The objective of this project is to analyze the Hitters dataset using statistical and machine learning techniques in order to predict the salary level of major league baseball players. The main focus is in applying and evaluating a multinomial logistic regression model, using a wide set of performance statistics as predictors. The salary variable, originally quantitative, is transformed into a categorical variable through salary tertiles. By implementing proper data preprocessing, visualization, model fitting, and cross-validation strategies, we aim to identify the most accurate and interpretable model. The final outcome of this work is to evaluate the predictive power of the full model versus reduced models, while also gaining insight into the key performance metrics that drive salary differences in professional baseball.
library(tidyverse)
library(nnet)
library(tidyr)
library(ggplot2)
library(dplyr)
library(knitr)
library(corrplot)
library(ggcorrplot)
library(DT)
library(skimr)
library(tibble)
library(e1071)
library(dplyr)
library(patchwork)
library(car)
library(ggthemes)
library(scales)
library(viridis)
library(ggrepel)
library(glmnet)
library(reshape2)
library(plotly)
Hitters <- read.csv("Hitters.csv")
In this section, we examine the overall structure of the Hitters
dataset. The code creates a summary table showing the variable names,
their data types, and sample values using the
tibble,
sapply, and
head functions. This helps quickly
understand what kind of data we’re working with. Additionally, we use
the skim() function from the
skimr package to produce an in-depth
summary of the dataset, including number of rows, column types, missing
values, and basic statistics such as mean, standard deviation, and
quartiles for each variable.
structure_table <- tibble(
Variable = names(Hitters),
Type = sapply(Hitters, function(x) class(x)[1]),
Example_Values = sapply(Hitters, function(x) paste(head(x, 3), collapse = ", "))
)
knitr::kable(structure_table, caption = "Structure of the Hitters Dataset")
| Variable | Type | Example_Values |
|---|---|---|
| AtBat | integer | 293, 315, 479 |
| Hits | integer | 66, 81, 130 |
| HmRun | integer | 1, 7, 18 |
| Runs | integer | 30, 24, 66 |
| RBI | integer | 29, 38, 72 |
| Walks | integer | 14, 39, 76 |
| Years | integer | 1, 14, 3 |
| CAtBat | integer | 293, 3449, 1624 |
| CHits | integer | 66, 835, 457 |
| CHmRun | integer | 1, 69, 63 |
| CRuns | integer | 30, 321, 224 |
| CRBI | integer | 29, 414, 266 |
| CWalks | integer | 14, 375, 263 |
| League | character | A, N, A |
| Division | character | E, W, W |
| PutOuts | integer | 446, 632, 880 |
| Assists | integer | 33, 43, 82 |
| Errors | integer | 20, 10, 14 |
| Salary | numeric | NA, 475, 480 |
| NewLeague | character | A, N, A |
skim(Hitters)
| Name | Hitters |
| Number of rows | 317 |
| Number of columns | 20 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| League | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| Division | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
| NewLeague | 0 | 1 | 1 | 1 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| AtBat | 0 | 1.00 | 381.23 | 153.35 | 16.0 | 256 | 379 | 512 | 687 | ▁▇▇▇▅ |
| Hits | 0 | 1.00 | 101.14 | 46.49 | 1.0 | 64 | 96 | 137 | 238 | ▂▇▆▃▁ |
| HmRun | 0 | 1.00 | 10.75 | 8.74 | 0.0 | 4 | 8 | 16 | 40 | ▇▃▂▁▁ |
| Runs | 0 | 1.00 | 50.89 | 26.02 | 0.0 | 30 | 48 | 69 | 130 | ▅▇▆▃▁ |
| RBI | 0 | 1.00 | 47.95 | 26.07 | 0.0 | 28 | 44 | 63 | 121 | ▃▇▃▃▁ |
| Walks | 0 | 1.00 | 38.51 | 21.33 | 0.0 | 22 | 35 | 53 | 105 | ▅▇▅▂▁ |
| Years | 0 | 1.00 | 7.43 | 4.93 | 1.0 | 4 | 6 | 11 | 24 | ▇▅▂▂▁ |
| CAtBat | 0 | 1.00 | 2646.22 | 2327.24 | 19.0 | 822 | 1928 | 3919 | 14053 | ▇▃▂▁▁ |
| CHits | 0 | 1.00 | 717.31 | 655.93 | 4.0 | 209 | 506 | 1051 | 4256 | ▇▃▁▁▁ |
| CHmRun | 0 | 1.00 | 68.94 | 86.03 | 0.0 | 14 | 37 | 90 | 548 | ▇▁▁▁▁ |
| CRuns | 0 | 1.00 | 358.15 | 334.22 | 1.0 | 101 | 247 | 518 | 2165 | ▇▂▁▁▁ |
| CRBI | 0 | 1.00 | 328.83 | 332.82 | 0.0 | 91 | 219 | 421 | 1659 | ▇▂▁▁▁ |
| CWalks | 0 | 1.00 | 258.01 | 264.74 | 0.0 | 68 | 170 | 337 | 1566 | ▇▂▁▁▁ |
| PutOuts | 0 | 1.00 | 290.54 | 282.31 | 0.0 | 109 | 212 | 325 | 1378 | ▇▃▁▁▁ |
| Assists | 0 | 1.00 | 107.94 | 137.44 | 0.0 | 7 | 40 | 166 | 492 | ▇▂▁▁▁ |
| Errors | 0 | 1.00 | 8.09 | 6.39 | 0.0 | 3 | 6 | 11 | 32 | ▇▅▂▁▁ |
| Salary | 58 | 0.82 | 532.81 | 451.64 | 67.5 | 190 | 420 | 750 | 2460 | ▇▃▁▁▁ |
Interpretation
AtBat has a mean of around 381, while
career statistics like CAtBat can reach
values as high as 14,053.cat_vars <- Hitters %>% select_if(is.character)
plots <- lapply(names(cat_vars), function(var) {
ggplot(Hitters, aes_string(x = var)) +
geom_bar(fill = "#914777", color = "black") +
labs(title = paste("Barplot of", var),
x = var, y = "Count") +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_text(angle = 0)
)
})
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
wrap_plots(plots, nrow = 1)
Interpretation
League,
Division, and
NewLeague.League) and
1987 (NewLeague) seasons, although the
distribution is fairly balanced with League N. Similarly, the
Division variable is almost perfectly
split between East (E) and West (W) divisions.num_duplicates <- sum(duplicated(Hitters))
cat("Number of duplicate rows:", num_duplicates, "\n")
## Number of duplicate rows: 0
cat("Missing values before removal:\n\n")
## Missing values before removal:
missing_before <- colSums(is.na(Hitters))
print(missing_before[missing_before > 0])
## Salary
## 58
Hitters <- na.omit(Hitters)
cat("\n Missing values after removal:\n\n")
##
## Missing values after removal:
missing_after <- colSums(is.na(Hitters))
print(missing_after[missing_after > 0])
## named numeric(0)
cat("\n New dataset dimensions:\n")
##
## New dataset dimensions:
print(dim(Hitters))
## [1] 259 20
Interpretation
Salary column. Since our
analysis relies heavily on salary as the response variable, it was
necessary to remove these incomplete entries.na.omit(), the number
of observations dropped from 317 to 259. This ensures that all remaining
rows contain complete information, which is crucial for reliable
modeling, especially when fitting a multinomial logistic regression
where missing data could skew the predictions or reduce accuracy.This code converts the three categorical variables
(League,
Division, and
NewLeague) from character type to numeric
codes using as.factor() followed by
as.numeric(). This is necessary because
most machine learning models in R require numeric input, not text or
factor labels.
Hitters$League <- as.numeric(as.factor(Hitters$League))
Hitters$Division <- as.numeric(as.factor(Hitters$Division))
Hitters$NewLeague <- as.numeric(as.factor(Hitters$NewLeague))
str(Hitters[, c("League", "Division", "NewLeague")])
## 'data.frame': 259 obs. of 3 variables:
## $ League : num 2 1 2 2 1 2 1 2 1 2 ...
## $ Division : num 2 2 1 1 2 1 2 2 1 2 ...
## $ NewLeague: num 2 1 2 2 1 1 1 2 1 2 ...
unique(Hitters$League)
## [1] 2 1
unique(Hitters$Division)
## [1] 2 1
unique(Hitters$NewLeague)
## [1] 2 1
Interpretation * After encoding, each category is now represented by either 1 or 2. For example, League A and N are now 1 and 2, respectively. This transformation ensures that the categorical variables can be used properly in the regression model without causing erros.
This chunk identifies skewed numerical variables using the
skewness() function and applies a log
transformation to reduce the skewness.
A threshold of absolute skewness > 1 was used to define which variables needed transformation.
We visualize the distributions before and after the transformation for comparison. Then, the code overwrites the original variables with their log-transformed versions and summarizes the skewness reduction in a comparison table.
This step is important to improve model performance and meet assumptions of algorithms that expect more normally distributed input.
numeric_vars <- Hitters[, sapply(Hitters, is.numeric)]
skew_vals <- sapply(numeric_vars, function(x) skewness(x, na.rm = TRUE))
skewed_vars <- names(skew_vals[abs(skew_vals) > 1])
par(mfrow = c(2,2))
for (var in head(skewed_vars, 4)) {
hist(Hitters[[var]], main = paste("Original:", var), col = "#c268b7", xlab = var)
hist(log(Hitters[[var]] + 1), main = paste("Log-Transformed:", var), col = "#76226b", xlab = paste("log(", var, ")"))
}
par(mfrow = c(1,1))
for (var in skewed_vars){
if (all(Hitters[[var]] >= 0)) {
Hitters[[var]] <- log(Hitters[[var]] + 1)
}
}
post_skew <- sapply(Hitters[skewed_vars], function(x) skewness(x, na.rm = TRUE))
Skew_Before <- round(skew_vals[skewed_vars], 2)
Skew_After <- round(post_skew, 2)
skew_table <- data.frame(
Variable = skewed_vars,
Skew_Before = Skew_Before,
Skew_After = Skew_After
)
knitr::kable(skew_table, caption = "Skewness BEfore and After Log Transformation")
| Variable | Skew_Before | Skew_After | |
|---|---|---|---|
| CAtBat | CAtBat | 1.32 | -0.68 |
| CHits | CHits | 1.46 | -0.65 |
| CHmRun | CHmRun | 2.23 | -0.44 |
| CRuns | CRuns | 1.53 | -0.62 |
| CRBI | CRBI | 1.55 | -0.49 |
| CWalks | CWalks | 1.90 | -0.51 |
| PutOuts | PutOuts | 2.04 | -2.22 |
| Assists | Assists | 1.14 | -0.29 |
| Salary | Salary | 1.60 | -0.16 |
Interpretation
The histograms show that variables like
CAtBat.
CHits, and
CRuns were highly right-skewed initially,
but their distributions became much more symmetric after log
transformation.
The summary table confirms this improvement: skewness values decreased significantly and even flipped signs in some cases.
Reducing skewness like this can help stabilize variance and improve the reliability of our regression models, especially when using techniques sensitive to the distribution of input variables.
This code uses ggplot2 to create
faceted boxplots for all numeric variables in the Hitters dataset.
It reshapes the data into long format and displays each variable in
its own panel using facet_wrap().
Outliers are highlighted in blue to make them visually stand out. This is helpful for quickly identifying extreme values across all features in one compact view.
Hitters_long <- Hitters %>%
select_if(is.numeric) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(Hitters_long, aes(x = "", y = Value, fill = Variable)) +
geom_boxplot(outlier.shape = 21, outlier.color = "blue", outlier.fill = "white") +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
labs(title = "Boxplots of Variables (Faceted)",
x = "", y = "Value") +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(face = "bold", size = 10),
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
Interpretation
The boxplots reveal that several variables have outliers, with some showing strong asymmetry or long tails.
This is important in our project since outliers can distort model training and affect the accuracy of predictions, especially in regression models.
Here we detect and remove outliers from the numeric variables in the Hitters dataset using IQR (Interquartile Range) method.
A custom function identifies row indices where values fall outside
the range $Q1 - 1.5 IQR,
Q3 + 1.5 * IQR, which is a common
threshold for defining outliers.
This function is applied across all numeric columns, and all corresponding rows with outliers are collected and removed. The final output prints how many rows were removed and updates the dataset to reflect its new size.
get_outlier_indices <- function(x) {
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
which(x < (Q1 - 1.5 * IQR) | x > (Q3 + 1.5 * IQR))
}
numeric_cols <- sapply(Hitters, is.numeric)
Hitters_numeric <- Hitters[, numeric_cols]
outlier_indices <- unique(unlist(lapply(Hitters_numeric, get_outlier_indices)))
cat("Total outliers removed:", length(outlier_indices), "\n")
## Total outliers removed: 21
Hitters <- Hitters[-outlier_indices, ]
cat("New dataset size:", dim(Hitters)[1], "rows and", dim(Hitters)[2], "columns\n")
## New dataset size: 238 rows and 20 columns
Hitters_long_clean <- Hitters %>%
select_if(is.numeric) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(Hitters_long_clean, aes(x = "", y = Value, fill = Variable)) +
geom_boxplot(outlier.shape = 21, outlier.color = "blue", outlier.fill = "white") +
facet_wrap(~ Variable, scales = "free", ncol = 4) +
labs(title = "Boxplots of Variables After Outlier Removal (Faceted)",
x = "", y = "Value") +
theme_minimal(base_size = 12) +
theme(
strip.text = element_text(face = "bold", size = 10),
plot.title = element_text(hjust = 0.5, face = "bold"),
axis.text.x = element_blank(),
axis.ticks.x = element_blank()
)
Interpretation
After applying the IQR method to remove outliers, the boxplots of
the Hitters dataset appear much cleaner and more symmetric. Most
variables–like CAtBat,
CHits,
CRuns, and
Salary–show tighter interquartile ranges
and fewer extreme points, which improves the reliability of the modeling
process. While some mild outliers still exist, the overall spread of the
data is now more balanced.
This step is crucial in our project because extreme values could distort the predictions of the multinomial logistic regression model we’re using to classify players into salary levels. With the reduced influence of outliers, the model will now generalize better to unseen data and produce more stable and accurate results.
In this step, we scale all numeric predictors in the dataset using standardization (z-score scaling), which centers the variables at a mean 0 and a standard deviation of 1.
We exclude categorical variables and the target variable
Salary from this process, as they should
not be scaled. This scaling ensures that all features contribute equally
to distance-based models and avoids dominance by variables with larger
numeric ranges.
numeric_cols <- sapply(Hitters, is.numeric)
exclude_cols <- c("League", "Division", "NewLeague", "Salary")
numeric_cols[names(numeric_cols) %in% exclude_cols] <- FALSE
Hitters_scaled <- Hitters
Hitters_scaled[, numeric_cols] <- scale(Hitters[, numeric_cols])
| AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | League | Division | PutOuts | Assists | Errors | Salary | NewLeague | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. :-1.94928 | Min. :-1.84561 | Min. :-1.3445 | Min. :-1.8892 | Min. :-1.7057 | Min. :-1.8198 | Min. :-1.3575 | Min. :-2.3617 | Min. :-2.3277 | Min. :-2.39113 | Min. :-2.56997 | Min. :-2.7486 | Min. :-2.7933 | Min. :1.000 | Min. :1.0 | Min. :-2.54420 | Min. :-2.24069 | Min. :-1.4023 | Min. :4.227 | Min. :1.000 | |
| 1st Qu.:-0.86658 | 1st Qu.:-0.81352 | 1st Qu.:-0.7565 | 1st Qu.:-0.8564 | 1st Qu.:-0.8330 | 1st Qu.:-0.8166 | 1st Qu.:-0.6881 | 1st Qu.:-0.6702 | 1st Qu.:-0.7236 | 1st Qu.:-0.65527 | 1st Qu.:-0.71556 | 1st Qu.:-0.6585 | 1st Qu.:-0.6853 | 1st Qu.:1.000 | 1st Qu.:1.0 | 1st Qu.:-0.63300 | 1st Qu.:-0.91467 | 1st Qu.:-0.7672 | 1st Qu.:5.254 | 1st Qu.:1.000 | |
| Median : 0.05062 | Median :-0.03807 | Median :-0.2861 | Median :-0.1016 | Median :-0.1956 | Median :-0.1955 | Median :-0.2419 | Median : 0.1086 | Median : 0.1054 | Median : 0.09905 | Median : 0.09282 | Median : 0.1291 | Median : 0.1066 | Median :1.000 | Median :1.5 | Median : 0.03888 | Median : 0.07404 | Median :-0.2909 | Median :6.039 | Median :1.000 | |
| Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.0000 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | Mean : 0.0000 | Mean :1.475 | Mean :1.5 | Mean : 0.00000 | Mean : 0.00000 | Mean : 0.0000 | Mean :5.924 | Mean :1.471 | |
| 3rd Qu.: 0.82817 | 3rd Qu.: 0.75413 | 3rd Qu.: 0.6547 | 3rd Qu.: 0.7623 | 3rd Qu.: 0.7947 | 3rd Qu.: 0.7480 | 3rd Qu.: 0.6506 | 3rd Qu.: 0.7958 | 3rd Qu.: 0.7527 | 3rd Qu.: 0.72896 | 3rd Qu.: 0.73143 | 3rd Qu.: 0.7180 | 3rd Qu.: 0.7016 | 3rd Qu.:2.000 | 3rd Qu.:2.0 | 3rd Qu.: 0.48004 | 3rd Qu.: 0.90812 | 3rd Qu.: 0.6618 | 3rd Qu.:6.621 | 3rd Qu.:2.000 | |
| Max. : 1.91949 | Max. : 2.86292 | Max. : 2.6539 | Max. : 2.9570 | Max. : 2.7264 | Max. : 3.0532 | Max. : 2.6588 | Max. : 1.6920 | Max. : 1.7349 | Max. : 1.92503 | Max. : 1.63284 | Max. : 1.8369 | Max. : 2.1348 | Max. :2.000 | Max. :2.0 | Max. : 2.27718 | Max. : 1.45696 | Max. : 2.7259 | Max. :7.808 | Max. :2.000 |
Interpretation
The summary table shows that all scaled numeric variables now have a mean close to 0 and comparable ranges, with values mostly between -2.5 and 2.5. This is essential for our multinomial logistic regression and the other models because many algorithms are sensitive to the scale of input features.
Without scaling, variables like
CAtBat or
CRuns could have overpowered other
predictors, leading to biased coefficients or inefficient splits in the
tree-based model. This step helps us build fair and interpretabl emodels
for predicting player salary levels.
par(mfrow = c(1,2))
hist(Hitters$Years, breaks = 20, main = "Original Years", col = "#c268b7")
hist(Hitters_scaled$Years, breaks = 20, main = "Scaled Years", col = "#76226c")
par(mfrow = c(1,1))
Interpretation
The histogram comparison shows the variable
Years before and after scaling. While the
shape distribution remain the same (right-skewed), the scaled version
now has a mean of 0 and is measured in standard deviations.
This ensures that Years, like the
other predictors, contributes fairly during model training without being
influenced by its original scale or unit.
| AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | PutOuts | Assists | Errors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 408.66 | 109.71 | 11.43 | 55.56 | 51.49 | 41.09 | 7.08 | 7.46 | 6.13 | 3.56 | 5.42 | 5.28 | 5.05 | 5.4 | 3.76 | 8.83 |
| SD | 145.01 | 44.81 | 8.50 | 25.17 | 25.50 | 20.93 | 4.48 | 0.96 | 1.00 | 1.20 | 1.01 | 1.05 | 1.02 | 0.8 | 1.68 | 6.30 |
| AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | PutOuts | Assists | Errors | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| SD | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
Interpretation
The first table shows the mean and standard deviation of numeric variables before scaling, where we can see a wide range of values.
For example, RBI had a mean of
51.49 and a standard deviation of 25.50, while
CHmRun had a much smaller mean of 3.56.
This difference in scale could cause certain variables to dominate
during model training.
After applying standardization, as shown in the second table, all numeric variables now have a mean of 0 and a standard deviation of 1.
This uniform scale allows each feature to contribute equally to distance-based and regression models, such as the multinomial logistic regression used in this project to classify salary levels. It improves convergence and interpretability by putting all predictors on the same footing.
This code fits a multiple linear regression model with
Salary as the dependent variable and all
other predictors from the scaled dataset as independent variables.
It then extracts the p-values of eacah predictor from the model summary and displays them in ascending order. This helps us evaluate which variables are statistically significant in predicting salary, based on the common threshold.
linear_model <- lm(Salary ~ ., data = Hitters_scaled)
summary_lm <- summary(linear_model)
p_values <- summary_lm$coefficients[, 4]
pval_df <- data.frame(
Variable = names(p_values),
P_value = round(p_values, 4)
) %>%
arrange(P_value)
knitr::kable(pval_df, caption = "P-values for Predictors of Salary (Full Model, No Splitting)")
| Variable | P_value | |
|---|---|---|
| (Intercept) | (Intercept) | 0.0000 |
| Years | Years | 0.0000 |
| Walks | Walks | 0.0207 |
| PutOuts | PutOuts | 0.0412 |
| AtBat | AtBat | 0.0527 |
| Division | Division | 0.0771 |
| CHits | CHits | 0.1507 |
| Hits | Hits | 0.2279 |
| Errors | Errors | 0.3639 |
| RBI | RBI | 0.4259 |
| Runs | Runs | 0.5654 |
| CHmRun | CHmRun | 0.6670 |
| CWalks | CWalks | 0.6889 |
| CAtBat | CAtBat | 0.7182 |
| NewLeague | NewLeague | 0.7468 |
| CRBI | CRBI | 0.7643 |
| CRuns | CRuns | 0.8234 |
| HmRun | HmRun | 0.8388 |
| League | League | 0.9500 |
| Assists | Assists | 0.9858 |
Interpretation
From the results, Years,
Walks, and
PutOuts have p-values below 0.05, meaning
they are statistically significant predictors of salary in our dataset.
This makes sense in the context of our project, as more years of
experience and better performance in fielding and on-base skills (like
walks) are likely to influence player salaries.
On the other hand, variables like
CHmRun,
CRuns, and
RBI show higher p-values, indicating
weaker or non-significant relationships with salary when controlling for
other variables. This insight can guide us in refining our feature set
or focusing our model on the most relevant predictors.
numeric_vars <- Hitters_scaled %>% select_if(is.numeric)
cor_with_salary <- cor(numeric_vars, use = "complete.obs")[, "Salary"]
cor_df <- tibble(
Variable = names(cor_with_salary),
Correlation = cor_with_salary
) %>%
filter(Variable != "Salary") %>%
arrange(desc(Correlation))
ggplot(cor_df, aes(x = "Salary", y = reorder(Variable, Correlation), fill = Correlation)) +
geom_tile(color = "white", linewidth = 0.7) +
geom_text(aes(label = round(Correlation, 2)), size = 4, color = "black") +
scale_fill_gradient2(
low = "#67a9cf",
mid = "white",
high = "#d01c8b",
midpoint = 0,
limits = c(-1, 1),
name = "Correlation"
) +
labs(
title = "Heatmap of Correlation with Salary",
x = "",
y = "Variables"
) +
theme_minimal(base_size = 13) +
theme(
axis.text.x = element_blank(),
axis.ticks.x = element_blank(),
panel.grid = element_blank(),
plot.title = element_text(face = "bold", hjust = 0.5)
)
Interpretation
This heatmap shows the strength of the linear correlation between
each variable and the player’s salary. We can see that variables like
CHits,
CRuns, CRBI,
CAtBat, and
CWalks have the highest positive
correlations (above 0.79), suggesting that career performance stats are
strong indicators of a player’s salary.
In contrast, variables like League,
NewLeague, and
Division have very weak or even slightly
negative correlations, meaning they are unlikely to have a meaningfyl
linear relationship with salary. This aligns with the context of our
project–players are likely paid more for consistent performance over
their careers rather than their current league or division.
In this code chunk, we calculates and visualized the full correlation
matrix for all variables in the
Hitters_scaled dataset. The first step
converts any factor variables into numeric from using
lapply, then computes pairwise
correlations using cor() with
complete.obs to handle any remaining NA
values.
We then define a custom colors palette using three shades to visually
represent the strength and direction of the correlations. Finally, we
used ggcorrplot() to generate a heatmap of
the matrix, showing only the upper triangle for a cleaner visual
layout.
Hitters_num <- as.data.frame((lapply(Hitters_scaled, function(x) {
if (is.factor(x)) as.numeric(x) else x
})))
# Compute correlation matrix
cor_matrix <- cor(Hitters_num, use = "complete.obs")
custom_colors <- c("#c268b7", "white", "#76226c")
ggcorrplot(cor_matrix,
method = "square",
type = "upper",
lab = FALSE,
colors = custom_colors,
title = "Correlation Matrix of All Variables") +
theme(
plot.title = element_text(hjust = 0.5, size = 14, face = "bold", margin = ggplot2::margin(t = 5)),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1, size = 8),
axis.text.y = element_text(size = 8, margin = ggplot2::margin(r = 5))
)
Interpretation
The correlation matrix heatmap gives us a comprehensive view of
how each variable relates to the others in our dataset. Most notably, we
can observe strong positive correlations among cumulative performance
metrics like CHites,
CRuns, CRBI,
and CAtBat, which makes sense as better
career performance is likely associated across various batting
statistics.
There’s also noticeable cluster of high correlation between these
cumulative stats and Salary, reinforcing
that long-term performance plays a key role in salary
determination.
On the other hand, categorical variables like
League and
Division show very low correlation with
most numeric predictors and with salary, suggesting that league
affiliation may not be a key salary driver in this model. This matrix
helps us understand multicollinearity risks and guides variable
selection for modeling.
datatable(round(cor_matrix, 2),
caption = "Interactive Correlation Matrix",
options = list(pageLength = 10, scrollX = TRUE))
Interpretation
This interactive table presents the pairwise correlation
coefficients between the numeric variables in the Hitters dataset. The
high values between AtBat,
Hits, and
Runs suggest a very strong linear
relationship.
This makes sense contextually because players who have more
at-bats typically accumulate more hits and, consequently, more runs.
Similarly, RBI (Runs Batted In) shows high
correlation with both Hits (0.78) and
HmRun (0.84), reflecting how power hitters
significantly to scoring.
Interestingly, Years has almost no
correlation with most of the other variables, indicating that tenure
alone doesn’t necessarily drive performance metrics. This table helps us
detect multicollinearity between predictors–critical when building
regression models–since variables like
Hits and
Runs might be too closely related to be
included together.
For modeling salary, this insight supports selecting variables that add unique information rather than those that are strongly correlated with each other.
This code calculates the Variance Inflation Factor (VIF) for each predictor in our linear regression model using the scaled Hitters dataset.
VIF measures how much the variance of a regression coefficient is inflated due to multicollinearity with other variables.
vif_model <- lm(Salary ~ ., data = Hitters_scaled)
vif_values <- vif(vif_model)
vif_values
## AtBat Hits HmRun Runs RBI Walks Years
## 27.957154 38.963208 9.175811 17.514468 13.280277 6.311489 6.332242
## CAtBat CHits CHmRun CRuns CRBI CWalks League
## 398.185151 512.857270 17.634045 112.645847 79.336048 31.066410 4.495162
## Division PutOuts Assists Errors NewLeague
## 1.058111 1.426682 2.357862 2.245484 4.486147
Interpretation
Looking at the results, several variables show high VIF values. This means these variables are highly correlated with others in the model, inflating their variance and potentially distorting the model’s interpretation and reliability.
This is expected given they’re all cumulative career stats, which often move together. To improve model stability and interpretability, it may be necessary to remove or combine some of these multicollinear predictors.
This code performs Lasso Regression with cross-validation on the Hitters dataset to identify which predictors best explain salary.
First, it splits the features (x) and the response (y = Salary), then
performs Lazzo using cv.glmnet() to find
the optimal value of lambda that minimizes the Mean Squared Error(MSE).
The final model is built using that best lambda, and the non-zero
coefficients (selected variables) are extracted.
A plot is generated showing how MSE changes across different values of log(lambda), highlighting the optimal and 1SE rule lambdas.
x <- as.matrix(Hitters_scaled[, !colnames(Hitters_scaled) %in% "Salary"])
y <- Hitters_scaled$Salary
# Perform Lasso with cross-validation
set.seed(123)
lasso_cv <- cv.glmnet(x, y, alpha = 1, standardize = FALSE)
# Extract optimal lambda values
best_lambda <- lasso_cv$lambda.min
lambda_1se <- lasso_cv$lambda.1se
# Fit final model using best lambda
final_model <- glmnet(x, y, alpha = 1, lambda = best_lambda, standardize = FALSE)
coef_matrix <- coef(final_model)
coef_df <- as.data.frame(as.matrix(coef_matrix))
col_name <- colnames(coef_df)[1]
coef_df <- coef_df %>%
rownames_to_column("Variable") %>%
rename(Coefficient = all_of(col_name)) %>%
filter(Variable != "(Intercept)" & Coefficient != 0)
kable(coef_df, digits = 4, caption = "Variables Selected by Lasso (Non-zero Coefficients)")
| Variable | Coefficient |
|---|---|
| AtBat | -0.1644 |
| Hits | 0.1018 |
| Runs | -0.0239 |
| RBI | 0.0678 |
| Walks | 0.1185 |
| Years | -0.2974 |
| CHits | 0.9038 |
| CHmRun | 0.0396 |
| Division | -0.0929 |
| PutOuts | 0.0661 |
| Errors | -0.0338 |
| NewLeague | 0.0264 |
selected_vars <- coef_df$Variable
Hitters_lasso_final <- Hitters_scaled %>%
dplyr::select(all_of(selected_vars), Salary)
cv_plot_data <- data.frame(
lambda = lasso_cv$lambda,
mse = lasso_cv$cvm,
se = lasso_cv$cvsd
)
ggplot(cv_plot_data, aes(x = log(lambda), y = mse)) +
geom_line(color = "#91477f", size = 1) +
geom_point(color = "#f2739f", size = 2) +
geom_errorbar(aes(ymin = mse - se, ymax = mse + se), width = 0.15, color = "gray60") +
geom_vline(xintercept = log(best_lambda), linetype = "dashed", color = "#2b303a", linewidth = 1) +
geom_vline(xintercept = log(lambda_1se), linetype = "dotted", color = "#2b303a", linewidth = 1) +
labs(
title = "Lasso Cross-Validation Curve",
subtitle = "Optimal Lambda (dashed) vs 1SE Rule (dotted)",
x = "Log(λ)",
y = "Mean Squared Error"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16, color = "#4B0082"),
plot.subtitle = element_text(size = 12)
)
Interpretation
The Lasso Cross-Validation Curve shows that the best lambda occurs at the point where the Mean Squared Error is lowest, indicating the best balance between model complexity and prediction accuracy. The dotted line (1SE rule) represents a slightly more regularized model that’s simpler but may sacrifice a bit of predicitve power for parsimony.
From the table of Variables Selected by Lasso,
we see that 12 predictors have non-zero coefficients, meaning Lasso
deemed them useful for predicting salary. Notably,
CHits had the strongest positive effect
(0.9038), which is logical–more hits usually mean better performance and
higher salary.
Years had a suprisingly negative
coefficient, possibly because longer-tenured players with declining
stats might earn less than newer stars.
AtBat also showed a negative relationship,
which might reflect that simply being at bat doesn’t necessarily mean
high performance unless it translates into outcomes like hits or
runs.
Overall, Lasso helped simplify the model by selecting only relevant variables, reducing multicollinearity, and making the salary prediction model more interpretable and robust.
In this part, a subset of the dataset
(Hitters_lasso_final) is selected,
focusing on four variables: Years,
CHmRun,
Walks and
Runs. These are likely variables
reflecting experience and performance in baseball.
The corr() function calculates pairwise
Pearson correlation coefficients between them using complete
observations only.
Then, the matrix is reshaped into a long format using
melt() from the
reshape2 package to prepare for
visualization,
A heatmap is generated using ggplot2,
where aech tile represents the strength and direction of the
correlation, with a custom color gradient from red (negative) to blue
(positive), and numeric correlation values are overlaid directly on the
tiles for easier interpretation.
exp_perf_vars <- Hitters_lasso_final %>%
dplyr::select(Years, CHmRun, Walks, Runs)
cor_matrix <- cor(exp_perf_vars, use = "complete.obs") %>%
round(2)
cor_melted <- melt(cor_matrix)
ggplot(cor_melted, aes(Var1, Var2, fill = value)) +
geom_tile(color = "white", size = 0.5) +
geom_text(aes(label = value), color = "black", size = 4.5) +
scale_fill_gradient2(low = "#b2182b", mid = "white", high = "#2166ac",
midpoint = 0, limit = c(-1, 1), name = "Corr") +
labs(title = "Experience vs Performance: Correlation Matrix",
x = "", y = "") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust = 1),
plot.title = element_text(face = "bold", size = 14))
Interpretation
This heatmap helps explore how a player’s experience
(Years) related to their on-field
performance (CHmRun,
Walks, and
Runs). A strong positive correlation
(0.70) between Years and
CHmRun suggests that more experienced
players tend to contribute more to team home runs.
However, Years has almost no
correlation with Runs (-0.03) or
Walks (0.08), meaning simply playing
longer doesn’t necessarily increase a player’s scoring or plate
discipline.
Interestingly, Walks and
Runs are highly correlated (0.69), hinting
that players who draw more walks are more likely to score, which makes
sense tactically in baseball,
This matrix provides valuable insight into which aspects of experience contribute most to performance, which is central to understanding salary drivers in this dataset.
Here we calculate a custom Player Value Score as a proxy for performance efficiency using the formula:
(Runs + Walks + CHmRun) / (Years + 1)
This score reflects how productive a player is relative to their experience. A cap is applied at the 99th percentile to avoid extreme outliers skewing the plot.
Then, it selects the top 5 most efficient players based on this metric and visualizes two plots:
Player Value vs Salary: A scatter plot with a LOESS smooth line showing general trends.
Top 5 Most Efficient Players: Highlights the top scorers and labels them, helping compare performance to salary visually.
library(dplyr)
library(ggrepel)
Hitters_lasso_final <- Hitters_lasso_final %>%
mutate(Player_Value_Score = (Runs + Walks + CHmRun) / (Years + 1))
Hitters_lasso_final$Player_Value_Score <- pmin(
Hitters_lasso_final$Player_Value_Score,
quantile(Hitters_lasso_final$Player_Value_Score, 0.99)
)
top5 <- Hitters_lasso_final %>%
top_n(5, wt = Player_Value_Score) %>%
arrange(desc(Player_Value_Score)) %>%
mutate(Rank_Label = paste0("Rank ", row_number()))
p1 <- ggplot(Hitters_lasso_final, aes(x = Salary, y = Player_Value_Score)) +
geom_point(aes(color = Player_Value_Score), size = 2.5, alpha = 0.75) +
geom_smooth(method = "loess", se = FALSE, color = "darkblue", linetype = "solid") +
scale_color_viridis_c(option = "C", direction = -1, name = "Value Score") +
labs(
title = "Player Value vs. Salary",
subtitle = "Does higher salary reflect higher performance?",
x = "Salary (Scaled)",
y = "Player Value Score"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
plot.margin = ggplot2::margin(t = 15, r = 20, b = 10, l = 20, unit = "pt")
)
p2 <- ggplot(Hitters_lasso_final, aes(x = Salary, y = Player_Value_Score)) +
geom_point(color = "gray85", size = 2, alpha = 0.5) +
geom_point(data = top5, aes(color = Player_Value_Score), size = 5, alpha = 0.95) +
geom_text_repel(data = top5, aes(label = Rank_Label),
fontface = "bold", size = 4, box.padding = 0.5,
color = "black", segment.color = "gray40") +
scale_color_viridis_c(option = "D", direction = 1, name = "Value Score") +
labs(
title = "Top 5 Most Efficient Players",
subtitle = "Ranked by Player Value Score vs Salary",
x = "Salary (Scaled)",
y = "Player Value Score"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16),
plot.subtitle = element_text(size = 12),
plot.margin = ggplot2::margin(t = 15, r = 20, b = 10, l = 20, unit = "pt")
)
p1 / p2 + plot_layout(heights = c(1.1, 1))
Interpretation
Plot 1: Player Value vs Salary
In this plot, each dot represents a player, colored by their
Player_Value_Score. The horizontal axis
shows the scaled salry, while the vertical shows the value
score.
While we expect that higher salaries should reflect higher performance, the LOESS curve tells a different story. The relationship is weak and non-linear, suggesting that salary alone does not reliably predict how efficiently a player performs.
Interestingly, players with mid-range salaries sometimes outperform those with higher ones–some of the top value scores are from players not earning the highest.
Plot 2: Top 5 Most Efficient Players
This chart isolates and highlights the top 5 most efficient players, labeling them by rank. What stands out is that none of these players are the top earners. This reveals a disconnect between salary and actual on-field value, and highlights players who are likely underpaid relative to their contribution.
These insights could be useful for team managers aiming to spot hidden gems or negotiate fairer contracts.
Here we generate interactive scatterplots using
plotly to visualize how three selected
performance variables–Years,
AtBat, and
Errors–relate to players’ salaries.
The lapply() function loops over each
variable, and for each one, a scatterplot is created with players’
Salary on the y-axis and the selected variable on the x-axis.
Tooltips are customized for better interactivity. Finally,
subplot() is used to arrange the three
plots vertically with shared Y-axes and a centralized title using
layout().
features <- c("Years", "AtBat", "Errors")
styled_plots <- lapply(features, function(var) {
plot_ly(
data = Hitters_lasso_final,
x = ~get(var),
y = ~Salary,
type = 'scatter',
mode = 'markers',
marker = list(color = '#914777', opacity = 0.75, size = 6),
hovertemplate = paste0(
"<b>", var, "</b>: %{x:.2f}<br>",
"<b>Salary</b>: %{y:.2f}<extra></extra>"
)
) %>% layout(
xaxis = list(title = var, tickfont = list(size = 10), titlefont = list(size = 11)),
yaxis = list(title = "Salary", tickfont = list(size = 10), titlefont = list(size = 11)),
margin = list(l = 40, r = 10, b = 40, t = 50), # increased top margin for the title
showlegend = FALSE,
plot_bgcolor = '#fafafa',
paper_bgcolor = '#fafafa'
)
})
subplot(styled_plots, nrows = 2, shareY = TRUE, titleX = TRUE, titleY = TRUE, margin = 0.05) %>%
layout(
title = list(
text = "Interactive Scatterplots: Salary vs Performance Metrics",
x = 0.5, font = list(size = 16, family = "Arial", color = "black")
)
)
Interpretation
From the interactive scatterplots:
Years vs Salary: There’s a positive relationship, showing that players with more years of experience generally earn higher salaries. This makes intuitive sense as experience is often rewarded.
AtBat vs Salary: A mild upward trend appears, suggesting players who get to bat more frequently may earn more–likely due to their active presence and potentital offensive contribution.
Errors vs Salary: There is no clear trend, implying that the number of fielding errors made by a player doesn’t significantly affect salary–possibly because other metrics weigh more in contract negotiations.
Here we visualize the salary distribution of baseball players from the Hitters dataset, split by Division (East vs West).
The Division column is first re-labeled
with factor labels “East” and “West”, and custom colors are defined for
each group.
The plot uses geom_density() to draw
smooth distribution curves for salary. Styling is applied to make the
plot visually appealing with customized legends, titles, and
minimalistic theme settings.
Hitters_lasso_final$Division <- Hitters$Division
Hitters_lasso_final$Division <- factor(Hitters_lasso_final$Division, levels = c(1, 2), labels = c("East", "West"))
division_colors <- c("East" = "#914777", "West" = "#c268b7")
ggplot(Hitters_lasso_final, aes(x = Salary, fill = Division)) +
geom_density(alpha = 0.6, color = NA) +
scale_fill_manual(values = division_colors) +
labs(
title = "Salary Distribution by Division",
subtitle = "Comparing Scaled Salary Levels between East and West Divisions",
x = "Scaled Salary",
y = "Density",
fill = "Division"
) +
theme_minimal(base_size = 13) +
theme(
plot.title = element_text(face = "bold", size = 16, hjust = 0.5),
plot.subtitle = element_text(size = 12, hjust = 0.5),
axis.title = element_text(size = 12),
axis.text = element_text(size = 10),
legend.position = "top",
legend.title = element_text(face = "bold"),
legend.text = element_text(size = 10),
plot.margin = unit(c(12, 20, 10, 20), "pt")
)
Interpretation
The resulting density plot shows a comparison of scaled salary distributions between players in the East and West divisions.
Both distributions are unimodal but differ in shape:
This suggests that while salary levels are generally comparable across divisions, salary dispersion is wider in the West, potentially due to more outliers or a broader mix of player experience/performance levels.
Here, we remove the variable that we created to maintain the original predictors.
## [1] "AtBat" "Hits" "Runs" "RBI" "Walks" "Years"
## [7] "CHits" "CHmRun" "Division" "PutOuts" "Errors" "NewLeague"
## [13] "Salary"
In this chunk, we’re using the Elbow Method to decide how many clusters we should use to segment the players’ salaries.
We first remove NA values from the
Salary column and compute the total
within-cluster sum of squares (WSS) for clusters from 1 to 10 using
kmeans(). Then we calculate how much WSS
drops when moving from one cluster to the next, both in absolute and
percentage terms.
salary_vector <- na.omit(Hitters_final$Salary)
set.seed(123)
wss <- sapply(1:10, function(k) {
kmeans(salary_vector, centers = k, nstart = 20)$tot.withinss
})
wss_diff <- c(NA, diff(wss))
wss_pct_drop <- c(NA, round(100 * diff(wss) / head(wss, -1), 2))
elbow_table <- data.frame(
Clusters = 1:10,
WSS = round(wss, 0),
WSS_Drop = round(wss_diff, 0),
Percent_Drop = wss_pct_drop
)
print(elbow_table)
## Clusters WSS WSS_Drop Percent_Drop
## 1 1 183 NA NA
## 2 2 51 -133 -72.42
## 3 3 26 -24 -48.29
## 4 4 15 -11 -43.35
## 5 5 9 -6 -40.41
## 6 6 6 -3 -32.03
## 7 7 5 -1 -22.77
## 8 8 4 -1 -21.82
## 9 9 3 -1 -17.75
## 10 10 2 -1 -21.06
Interpretation
The results show a sharp WSS drop from 1 to 2 clusters (−72.42%), then gradually smaller improvements as k increases.
While 2 clusters have the biggest drop, the “elbow” is most visible at 4 clusters, where the rate of WSS reduction slows down significantly afterward.
This suggests that 4 is the statistically optimal number of clusters, balancing model complexity with clustering efficiency. It also aligns perfectly with our goal of categorizing players into four salary levels allowing us to move forward with data-driven and interpretable segmentation.
Here we create an elbow plot with numerical WSS labels at each point.
It helps visualize how the total within-cluster sum of squares (WSS)
changes with different numbers of clusters (K). The
plot() function draws the elbow plot, and
text() is used to annotate each point with
its corresponding WSS value.
plot(1:10, wss, type = "b", pch = 19,
xlab = "Number of Clusters K",
ylab = "Total Within-Cluster Sum of Squares",
main = "Elbow Plot with WSS Labels")
text(1:10, wss, labels = round(wss, 0), pos = 3, cex = 0.8)
Interpretation
The elbow plot clearly shows a sharp drop in WSS when increasing clusters from 1 to 2 (from 183 to 51), and again a noticeable drop up to K = 4. After K = 4, the decrease in WSS becomes much smaller, indicating diminishing returns.
While the biggest drop occurs at 2 clusters, we decided to go with 4 clusters as the optimal choice. This is because although K = 2 reduces WSS drastically, using 4 clusters gives us a more refined segmentation of salary levels without overcomplicating the model.
It balances efficiency and interpretability, which is ideal for categorizing salaries into meaningful salary levels for further analysis like classification.
In this chunk, we are creating a new categorical variable called
SalaryLevel by dividing the continuous
Salary variable into four levels based on
quartiles. This is done using the cut()
function, which splits the salary into 4 bins based on its quantile
distribution:
We use include.lowest = TRUE to make
sure the lowest salary value is included, and
na.rm = TRUE to handle any missing data.
Then, a frequency table and a barplot are created to show how many
players fall into each salary level.
Hitters_final$SalaryLevel <- cut(
Hitters_final$Salary,
breaks = quantile(Hitters_final$Salary, probs = c(0, 0.25, 0.5, 0.75, 1), na.rm = TRUE),
labels = c("Rookie", "Starter", "Pro", "MVP"),
include.lowest = TRUE
)
table(Hitters_final$SalaryLevel)
##
## Rookie Starter Pro MVP
## 60 59 65 54
barplot(table(Hitters_final$SalaryLevel), col = "darkblue",
main = "Number of Players per Salary Level",
ylab = "Count")
Interpretation
The barplot shows a relatively even distribution of players across the four salary levels: Rookie (60), Starter (59), Pro (65), and MVP (54). This confirms that our salary levels are balanced due to the use of quartiles.
This step is essential because we later use these salary levels as target categories in our classification models.
By creating well-distributed classes, we improve the performance and fairness of the models, preventing bias toward any single salary group. It also helps in interpretability.
Now we can talk about player predictions in terms of intuitive categories rather than raw salary numbers.
Here we remove the Salary attribute to prevent data leakage later on.
names(Hitters_final)
## [1] "AtBat" "Hits" "Runs" "RBI" "Walks"
## [6] "Years" "CHits" "CHmRun" "Division" "PutOuts"
## [11] "Errors" "NewLeague" "Salary" "SalaryLevel"
model_data <- subset(Hitters_final, select = -Salary)
names(model_data)
## [1] "AtBat" "Hits" "Runs" "RBI" "Walks"
## [6] "Years" "CHits" "CHmRun" "Division" "PutOuts"
## [11] "Errors" "NewLeague" "SalaryLevel"
Here we create a 2x2 grid of boxplots to visually explore how four
key performance indicators — Years,
Hits, RBI,
and Walks — vary across the defined
SalaryLevel categories (“Rookie”,
“Starter”, “Pro”, “MVP”).
Each boxplot shows the distribution of a specific variable across the four salary levels using different colors to help differentiate the plots.
par(mfrow = c(2, 2))
boxplot(model_data$Years ~ model_data$SalaryLevel,
main = "Years vs Salary Level", col = "lightblue", ylab = "Years")
boxplot(model_data$Hits ~ model_data$SalaryLevel,
main = "Hits vs Salary Level", col = "lightgreen", ylab = "Hits")
boxplot(model_data$RBI ~ model_data$SalaryLevel,
main = "RBI vs Salary Level", col = "lightcoral", ylab = "RBI")
boxplot(model_data$Walks ~ model_data$SalaryLevel,
main = "Walks vs Salary Level", col = "plum", ylab = "Walks")
par(mfrow = c(1, 1))
Interpretation
These boxplots help us understand how some important performance
stats differ across salary levels in the Hitters dataset. As expected,
players with higher salaries tend to have higher values in
Years, Hits,
RBI, and
Walks.
Years: There’s a clear upward trend — MVPs have played the longest, while Rookies are newer to the league.
Hits and RBI: Both show higher medians as we go from Rookie to MVP, which makes sense since performance likely impacts salary negotiations.
Walks: Again, MVPs walk more on average, which could be due to better plate discipline or reputation.
Overall, the boxplots confirm that players with stronger stats
generally fall into the higher salary levels, which supports the logic
behind the SalaryLevel grouping we created
earlier.
This chunk first summarizes all variables in the
model_data dataframe. Then, using
dplyr and
tidyr, it selects only the numeric
variables and reshapes them into a long format so each value can be
associated with its variable name.
The final ggplot uses
geom_boxplot() to visualize the
distribution of all numeric variables, showing medians, interquartile
ranges, and outliers across the entire dataset.
summary(model_data)
## AtBat Hits Runs RBI
## Min. :-1.94928 Min. :-1.84561 Min. :-1.8892 Min. :-1.7057
## 1st Qu.:-0.86658 1st Qu.:-0.81352 1st Qu.:-0.8564 1st Qu.:-0.8330
## Median : 0.05062 Median :-0.03807 Median :-0.1016 Median :-0.1956
## Mean : 0.00000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.82817 3rd Qu.: 0.75413 3rd Qu.: 0.7623 3rd Qu.: 0.7947
## Max. : 1.91949 Max. : 2.86292 Max. : 2.9570 Max. : 2.7264
## Walks Years CHits CHmRun
## Min. :-1.8198 Min. :-1.3575 Min. :-2.3277 Min. :-2.39113
## 1st Qu.:-0.8166 1st Qu.:-0.6881 1st Qu.:-0.7236 1st Qu.:-0.65527
## Median :-0.1955 Median :-0.2419 Median : 0.1054 Median : 0.09905
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.7480 3rd Qu.: 0.6506 3rd Qu.: 0.7527 3rd Qu.: 0.72896
## Max. : 3.0532 Max. : 2.6588 Max. : 1.7349 Max. : 1.92503
## Division PutOuts Errors NewLeague SalaryLevel
## East:119 Min. :-2.54420 Min. :-1.4023 Min. :1.000 Rookie :60
## West:119 1st Qu.:-0.63300 1st Qu.:-0.7672 1st Qu.:1.000 Starter:59
## Median : 0.03888 Median :-0.2909 Median :1.000 Pro :65
## Mean : 0.00000 Mean : 0.0000 Mean :1.471 MVP :54
## 3rd Qu.: 0.48004 3rd Qu.: 0.6618 3rd Qu.:2.000
## Max. : 2.27718 Max. : 2.7259 Max. :2.000
library(ggplot2)
library(dplyr)
library(tidyr)
numeric_vars <- model_data %>%
select_if(is.numeric) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(numeric_vars, aes(x = Variable, y = Value)) +
geom_boxplot(fill = "skyblue") +
labs(title = "Boxplots of Numeric Variables in model_data",
x = "Variable", y = "Value") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Interpretation
The boxplot helps us quickly scan for outliers, skewness, and the spread of each numeric feature. Most of the variables appear roughly centered around zero due to previous scaling, with balanced spreads. However, PutOuts and NewLeague stand out:
PutOuts has several extreme low outliers, meaning some players performed very poorly in this metric.
NewLeague seems more binary and right-skewed since it’s derived from a categorical variable turned numeric.
In this chunk, we’re running a multinomial logistic
regression using the multinom()
function from the nnet package to model
the relationship between several predictors and the
SalaryLevel outcome variable (with
categories: Rookie, Starter, Pro, MVP).
We compare each level against the baseline (“Rookie”) and extract
coefficients, standard errors, z-values, and p-values for
interpretation. Then, we combine all of this into a clean table using
kable().
library(nnet)
library(knitr)
library(dplyr)
full_model <- multinom(SalaryLevel ~ ., data = model_data, trace = FALSE)
summary_full <- summary(full_model)
coefs <- summary_full$coefficients
std_errs <- summary_full$standard.errors
z_vals <- coefs / std_errs
p_vals <- 2 * (1 - pnorm(abs(z_vals)))
levels <- rownames(coefs)
coef_list <- lapply(1:length(levels), function(i) {
data.frame(
Predictor = colnames(coefs),
Estimate = round(coefs[i, ], 4),
StdError = round(std_errs[i, ], 4),
Z_value = round(z_vals[i, ], 3),
P_value = round(p_vals[i, ], 4),
Comparison = levels[i]
)
})
final_table <- bind_rows(coef_list)
kable(final_table, caption = "Multinomial Logistic Regression Results by Salary Level (vs. Rookie)")
| Predictor | Estimate | StdError | Z_value | P_value | Comparison | |
|---|---|---|---|---|---|---|
| (Intercept)…1 | (Intercept) | 3.3682 | 1.2859 | 2.619 | 0.0088 | Starter |
| AtBat…2 | AtBat | -2.9380 | 1.6222 | -1.811 | 0.0701 | Starter |
| Hits…3 | Hits | 0.5993 | 1.3777 | 0.435 | 0.6636 | Starter |
| Runs…4 | Runs | 0.7695 | 1.0247 | 0.751 | 0.4527 | Starter |
| RBI…5 | RBI | 0.8891 | 0.6926 | 1.284 | 0.1993 | Starter |
| Walks…6 | Walks | 0.9943 | 0.5920 | 1.679 | 0.0931 | Starter |
| Years…7 | Years | -0.0375 | 1.1667 | -0.032 | 0.9744 | Starter |
| CHits…8 | CHits | 4.6897 | 1.2830 | 3.655 | 0.0003 | Starter |
| CHmRun…9 | CHmRun | -0.4984 | 0.6871 | -0.725 | 0.4682 | Starter |
| DivisionWest…10 | DivisionWest | -0.4336 | 0.6200 | -0.699 | 0.4844 | Starter |
| PutOuts…11 | PutOuts | -0.1894 | 0.3724 | -0.509 | 0.6110 | Starter |
| Errors…12 | Errors | -0.4851 | 0.3974 | -1.221 | 0.2221 | Starter |
| NewLeague…13 | NewLeague | 0.0254 | 0.6489 | 0.039 | 0.9688 | Starter |
| (Intercept)…14 | (Intercept) | 2.5340 | 1.4356 | 1.765 | 0.0775 | Pro |
| AtBat…15 | AtBat | -4.1186 | 1.8447 | -2.233 | 0.0256 | Pro |
| Hits…16 | Hits | 2.0939 | 1.5976 | 1.311 | 0.1900 | Pro |
| Runs…17 | Runs | -0.0571 | 1.1567 | -0.049 | 0.9606 | Pro |
| RBI…18 | RBI | 0.8455 | 0.8637 | 0.979 | 0.3276 | Pro |
| Walks…19 | Walks | 2.1102 | 0.6948 | 3.037 | 0.0024 | Pro |
| Years…20 | Years | -1.6495 | 1.2962 | -1.273 | 0.2032 | Pro |
| CHits…21 | CHits | 8.5713 | 1.6280 | 5.265 | 0.0000 | Pro |
| CHmRun…22 | CHmRun | -0.2560 | 0.8501 | -0.301 | 0.7633 | Pro |
| DivisionWest…23 | DivisionWest | -0.5269 | 0.7358 | -0.716 | 0.4740 | Pro |
| PutOuts…24 | PutOuts | 0.2207 | 0.4342 | 0.508 | 0.6112 | Pro |
| Errors…25 | Errors | -0.1883 | 0.4589 | -0.410 | 0.6815 | Pro |
| NewLeague…26 | NewLeague | 0.4121 | 0.7578 | 0.544 | 0.5866 | Pro |
| (Intercept)…27 | (Intercept) | 0.0246 | 1.7120 | 0.014 | 0.9885 | MVP |
| AtBat…28 | AtBat | -3.1746 | 2.0124 | -1.578 | 0.1147 | MVP |
| Hits…29 | Hits | 1.0602 | 1.8492 | 0.573 | 0.5664 | MVP |
| Runs…30 | Runs | -0.1673 | 1.2860 | -0.130 | 0.8965 | MVP |
| RBI…31 | RBI | 1.3944 | 0.9999 | 1.395 | 0.1631 | MVP |
| Walks…32 | Walks | 2.3725 | 0.7396 | 3.208 | 0.0013 | MVP |
| Years…33 | Years | -3.5471 | 1.4875 | -2.385 | 0.0171 | MVP |
| CHits…34 | CHits | 13.2905 | 2.1352 | 6.224 | 0.0000 | MVP |
| CHmRun…35 | CHmRun | -0.0104 | 1.0599 | -0.010 | 0.9922 | MVP |
| DivisionWest…36 | DivisionWest | -0.8901 | 0.8496 | -1.048 | 0.2948 | MVP |
| PutOuts…37 | PutOuts | 0.6009 | 0.5010 | 1.199 | 0.2304 | MVP |
| Errors…38 | Errors | -0.3352 | 0.5033 | -0.666 | 0.5054 | MVP |
| NewLeague…39 | NewLeague | 0.0775 | 0.8766 | 0.088 | 0.9295 | MVP |
Interpretation
The regression output shows which variables significantly differentiate between salary levels.
For example, when comparing MVPs to Rookies, the predictors with low p-values like CHits (p < 0.001) and Walks (p = 0.0013) are statistically significant, meaning they are strong indicators of being in the MVP group instead of Rookie. The estimate for CHits is very high (13.29), which suggests that cumulative hits have a big impact on predicting top-tier players.
Similarly, in the comparison between Pro and Rookie, Walks (p
= 0.0024) is also a strong predictor. This shows some consistency
across higher salary levels. On the other hand, many variables like
PutOuts,
DivisionWest, and
NewLeague have high p-values across all
comparisons, indicating they don’t play a significant role in predicting
salary level.
This model helps us understand which performance metrics are most associated with rising salary categories in baseball. It also confirms that certain intuitive metrics (like offensive stats) are more influential than others when it comes to determining pay level.
Cross-validation is a resampling method used to evaluate the performance of a model on unseen data.
Instead of splitting the dataset just once into training and testing sets, cross-validation divides the data into multiple folds, trains the model on some folds, and tests it on the remaining ones.
This process is repeated several times, and the performance metrics are averaged to provide a more robust estimate of model accuracy.
In our project, we implemented
10-fold cross-validation using the
train() function from the caret package.
This allowed us to fairly compare models by evaluating their predictive
power on different subsets of the data, ensuring that our results
weren’t biased by a specific train-test split.
We applied this method consistently across all models to maintain a reliable and consistent evaluation framework.
This chunk defines a method for selecting a cross-validation (CV)
approach randomly. First, we set a seed
(6102004) to ensure reproducibility of the
random sampling. Then, we define a list of possible validation
methods:
Finally, the sample() function randomly
picks one method from the list and assigns it to
chosen_method, which is then printed.
set.seed(6102004) # Nikol Tushaj's bday
cv_methods <- c(
"1. Vanilla validation set",
"2. Leave-One-Out CV (LOO-CV)",
"3. K-fold CV (K = 5)",
"4. K-fold CV (K = 10)"
)
chosen_method <- sample(cv_methods, 1)
chosen_method
## [1] "2. Leave-One-Out CV (LOO-CV)"
Interpretation
In our case, the selected method was Leave-One-Out Cross-Validation (LOO-CV).
We decided to go with this method because it’s particularly effective when dealing with smaller datasets, like ours.
LOO-CV trains the model on all observations except one, using the excluded one for testing, and repeats this for each data point.
This helps us get an unbiased estimate of model performance without wasting data on a dedicated validation set. It’s computationally more intensive, but offers high reliability—especially useful for fine-tuning our models for salary level prediction.
In this chunk, we use the train()
function from the caret package to train a
multinomial logistic regression model using Leave-One-Out
Cross-Validation (LOOCV).
We first set up a control object
ctrl_loo specifying
"LOOCV" as the method, then pass it to the
train() function along with the model
formula and data.
This means the model is trained 238 times (once for each observation), each time leaving one observation out for testing and using the rest for training.
library(caret)
library(nnet)
set.seed(6102004)
ctrl_loo <- trainControl(method = "LOOCV")
loo_model <- train(
SalaryLevel ~ .,
data = model_data,
method = "multinom",
trControl = ctrl_loo,
trace = FALSE
)
print(loo_model)
## Penalized Multinomial Regression
##
## 238 samples
## 12 predictor
## 4 classes: 'Rookie', 'Starter', 'Pro', 'MVP'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 237, 237, 237, 237, 237, 237, ...
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.6428571 0.5229674
## 1e-04 0.6428571 0.5229674
## 1e-01 0.6596639 0.5456302
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 0.1.
Interpretation
LOOCV was chosen from a list of possible cross-validation methods using random sampling (seed set for reproducibility). Although it was selected randomly, LOOCV is actually a solid choice for our dataset because it’s relatively small (238 observations).
LOOCV helps reduce bias and gives a very thorough evaluation of model performance, which is useful when we want reliable accuracy without sacrificing too many observations for testing.
From the output, we can see that the best-performing model had a decay parameter of 0.1 and reached an accuracy of about 65.97% with a Kappa of 0.55, which indicates decent agreement between predicted and actual salary levels.
This code uses the trained Leave-One-Out Cross-Validation (LOOCV)
multinomial model to predict the salary level for each player. It then
compares the predicted values to the actual values using a confusion
matrix (confusionMatrix() from the
caret package), which summarizes the
model’s classification performance.
library(caret)
preds <- predict(loo_model, newdata = model_data)
conf_mat <- confusionMatrix(preds, model_data$SalaryLevel)
acc <- round(conf_mat$overall["Accuracy"], 4)
kappa <- round(conf_mat$overall["Kappa"], 4)
cat("**Model Accuracy:**", acc, "\n")
## **Model Accuracy:** 0.7521
cat("**Cohen's Kappa:**", kappa, "\n")
## **Cohen's Kappa:** 0.6692
Interpretation
The model achieved an accuracy of 75.21%, meaning it correctly predicted salary levels in about three-quarters of the cases.
Additionally, a Cohen’s Kappa of 0.6692 suggests substantial agreement beyond chance, indicating that the model’s predictions are quite reliable for our classification task.
library(knitr)
library(kableExtra)
conf_df <- as.data.frame(conf_mat$table)
conf_wide <- tidyr::pivot_wider(
conf_df,
names_from = Reference,
values_from = Freq
)
kable(conf_wide, caption = "Confusion Matrix: Predicted vs Actual Salary Levels") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center"
)
| Prediction | Rookie | Starter | Pro | MVP |
|---|---|---|---|---|
| Rookie | 54 | 7 | 0 | 0 |
| Starter | 3 | 41 | 13 | 1 |
| Pro | 3 | 11 | 40 | 9 |
| MVP | 0 | 0 | 12 | 44 |
Interpretation
The resulting table shows how our model’s predictions compare to the actual salary levels.
Most “Rookie” and “MVP” players were correctly classified, but there’s a bit more confusion between “Starter” and “Pro” — especially with 13 Pros being misclassified as Starters and 11 Starters as Pros.
This hints that those two groups might be harder to separate based on the features we used.
In this chunk, we calculated class-specific performance metrics from
the confusion matrix using
conf_mat$byClass.
After converting the results into a dataframe and extracting the class labels, we calculated Balanced Accuracy for each class manually using the average of Sensitivity and Specificity.
We then selected the most relevant columns
(Class,
Sensitivity,
Specificity,
Balanced Accuracy) and rounded the numeric
values to three decimals.
library(dplyr)
by_class <- as.data.frame(conf_mat$byClass)
by_class$Class <- rownames(conf_mat$byClass)
rownames(by_class) <- NULL
by_class$Balanced.Accuracy <- round((by_class$Sensitivity + by_class$Specificity) / 2, 3)
class_metrics <- by_class %>%
dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
kable(class_metrics, caption = "Class-wise Performance Metrics") %>%
kable_styling(
full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed")
)
| Class | Sensitivity | Specificity | Balanced.Accuracy |
|---|---|---|---|
| Class: Rookie | 0.900 | 0.961 | 0.930 |
| Class: Starter | 0.695 | 0.905 | 0.800 |
| Class: Pro | 0.615 | 0.867 | 0.741 |
| Class: MVP | 0.815 | 0.935 | 0.875 |
Interpretation
model_data_full <- model_data
In this code, we applied stepwise model selection using AIC (Akaike Information Criterion) to find a more parsimonious version of our multinomial logistic regression model.
We started with a full model
(full_model_step) that included all
predictors in model_data_full, and then used
stepAIC() with
direction = "both" to allow both forward
selection and backward elimination.
This means variables could be added or removed iteratively based on which changes improved the AIC the most.
library(MASS)
library(nnet)
full_model_step <- multinom(SalaryLevel ~ ., data = model_data_full, trace = FALSE)
stepwise_model <- stepAIC(full_model_step, direction = "both", trace = FALSE)
summary(stepwise_model)
## Call:
## multinom(formula = SalaryLevel ~ AtBat + RBI + Walks + Years +
## CHits + PutOuts, data = model_data_full, trace = FALSE)
##
## Coefficients:
## (Intercept) AtBat RBI Walks Years CHits
## Starter 3.1465374 -1.674272 0.5392789 1.072940 0.2803077 4.075458
## Pro 2.7658801 -2.025021 0.6854214 1.967439 -1.2810930 8.222139
## MVP -0.3324148 -2.351867 1.4056373 2.259785 -2.9102031 12.765125
## PutOuts
## Starter -0.2032505
## Pro 0.1624575
## MVP 0.6621590
##
## Std. Errors:
## (Intercept) AtBat RBI Walks Years CHits PutOuts
## Starter 0.7397138 0.6499693 0.4282029 0.4738105 1.072957 1.032214 0.3500595
## Pro 0.7531370 0.7439332 0.5169125 0.5656089 1.204222 1.372753 0.4117279
## MVP 0.9928219 0.8427625 0.6098148 0.6118005 1.360912 1.814912 0.4757293
##
## Residual Deviance: 317.2372
## AIC: 359.2372
Interpretation
The goal of this step was to identify a simpler model that still captures the important relationships between predictors and salary level, while avoiding overfitting.
The summary output shows that variables like
AtBat, RBI,
Walks,
Years,
CHits, and
PutOuts were retained in the reduced
model. These were likely the most informative predictors in
distinguishing between salary levels like Rookie, Starter, Pro, and
MVP.
Looking at the AIC value (359.24) and the residual deviance (317.24), this model fits the data reasonably well with fewer predictors than the full model. Overall, this step helped us reduce model complexity while maintaining strong explanatory power.
In this section, we’re evaluating the performance of our reduced multinomial logistic regression model—a simpler model derived through stepwise AIC selection. We’re applying Leave-One-Out Cross-Validation (LOOCV) to this reduced model to ensure a fair and consistent comparison with the full model.
library(caret)
library(nnet)
set.seed(6102004)
ctrl_loo <- trainControl(method = "LOOCV")
reduced_model <- train(
formula(stepwise_model),
data = model_data_full,
method = "multinom",
trControl = ctrl_loo,
trace = FALSE
)
print(reduced_model)
## Penalized Multinomial Regression
##
## 238 samples
## 6 predictor
## 4 classes: 'Rookie', 'Starter', 'Pro', 'MVP'
##
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation
## Summary of sample sizes: 237, 237, 237, 237, 237, 237, ...
## Resampling results across tuning parameters:
##
## decay Accuracy Kappa
## 0e+00 0.6680672 0.5564415
## 1e-04 0.6680672 0.5564415
## 1e-01 0.6512605 0.5338839
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was decay = 1e-04.
Interpretation
The reduced model used only 6 predictors and still performed decently, achieving an accuracy of ~0.668 and a Cohen’s Kappa of ~0.556.
These metrics are slightly lower than those of the full model (which had an accuracy of ~0.752), indicating that while the reduced model is simpler and easier to interpret, it sacrifices a bit of predictive power.
However, the drop is not massive, which means the stepwise selection successfully removed less informative variables while retaining much of the model’s strength.
In this block, we’re evaluating the prediction performance of the stepwise-selected multinomial model.
We generate predictions on the full dataset using the reduced model, compute the confusion matrix comparing predicted vs actual salary levels, and format it nicely with kableExtra for display.
The goal is to assess how well the reduced model classifies players across the four salary categories (Rookie, Starter, Pro, MVP).
library(caret)
library(knitr)
library(kableExtra)
stepwise_preds <- predict(stepwise_model, newdata = model_data_full)
conf_mat_stepwise <- confusionMatrix(stepwise_preds, model_data_full$SalaryLevel)
conf_df <- as.data.frame(conf_mat_stepwise$table)
conf_wide <- tidyr::pivot_wider(conf_df,
names_from = Reference,
values_from = Freq)
kable(conf_wide, caption = "Confusion Matrix: Stepwise Model (Predicted vs Actual Salary Levels)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center")
| Prediction | Rookie | Starter | Pro | MVP |
|---|---|---|---|---|
| Rookie | 54 | 6 | 0 | 0 |
| Starter | 3 | 41 | 13 | 0 |
| Pro | 2 | 11 | 38 | 14 |
| MVP | 1 | 1 | 14 | 40 |
Interpretation
The confusion matrix shows strong performance for predicting Rookies and Starters, with 54 and 41 correctly classified, respectively.
However, there’s more confusion between Pro and MVP levels—especially MVPs being predicted as Pro (14 cases). This suggests the reduced model struggles to differentiate between higher-tier salary categories, possibly due to overlapping features or limited class separation in the data.
In this section, we compared the performance of the full and reduced multinomial logistic regression models by extracting their respective accuracy and Cohen’s Kappa scores from their confusion matrices.
These values were then stored in a data frame and displayed in a clean table format for easy comparison.
full_acc <- round(conf_mat$overall["Accuracy"], 4)
full_kappa <- round(conf_mat$overall["Kappa"], 4)
reduced_acc <- round(conf_mat_stepwise$overall["Accuracy"], 4)
reduced_kappa <- round(conf_mat_stepwise$overall["Kappa"], 4)
acc_kappa_df <- data.frame(
Model = c("Full Model", "Reduced Model"),
Accuracy = c(full_acc, reduced_acc),
Kappa = c(full_kappa, reduced_kappa)
)
library(knitr)
library(kableExtra)
kable(acc_kappa_df, caption = "Accuracy & Kappa: Full vs Reduced Model") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE,
position = "center")
| Model | Accuracy | Kappa |
|---|---|---|
| Full Model | 0.7521 | 0.6692 |
| Reduced Model | 0.7269 | 0.6354 |
Interpretation
The full model achieved slightly higher accuracy (0.7521) and Kappa (0.6692) compared to the reduced model (accuracy = 0.7269, Kappa = 0.6354).
While the difference is modest, this suggests that the additional predictors in the full model do slightly improve classification performance.
However, given the reduced model’s simplicity and comparable results, it may still be preferred when interpretability and efficiency are important.
In this step, we compared the confusion matrices of the full and reduced models side by side using a heatmap visualization. The matrices were converted into data frames and labeled accordingly, then merged into one dataset for plotting.
library(ggplot2)
conf_full_df <- as.data.frame(conf_mat$table)
conf_reduced_df <- as.data.frame(conf_mat_stepwise$table)
conf_full_df$Model <- "Full"
conf_reduced_df$Model <- "Reduced"
conf_all <- rbind(conf_full_df, conf_reduced_df)
ggplot(conf_all, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(color = "white", width = 0.9, height = 0.9) +
geom_text(aes(label = Freq), color = "black", size = 5) +
scale_fill_gradient(low = "#ffb3d9", high = "#660066") + #
facet_wrap(~Model, nrow = 1) +
labs(
title = "Confusion Matrices: Full vs Reduced Model",
x = "Actual", y = "Predicted"
) +
theme_minimal(base_size = 13) +
theme(
strip.text = element_text(face = "bold", size = 14),
plot.title = element_text(hjust = 0.5, size = 16),
axis.text = element_text(size = 12),
axis.title = element_text(size = 13)
)
Interpretation
The vanilla validation method refers to a simple train-test
split, where the dataset is divided into two parts: one for training the
model and the other for testing its performance. In our case, we used
the caTools package to randomly split the
data so that 80% is used for training
(train_data) and the remaining 20% for
testing (test_data). The split is
stratified based on the target variable
SalaryLevel to preserve class proportions
across the sets.
We chose this approach because it’s a straightforward and computationally light way to evaluate model performance, especially when dealing with reasonably sized datasets. This method allows us to simulate real-world prediction tasks by training on one portion of the data and testing on unseen data to assess generalization accuracy.
library(caTools)
set.seed(6102004)
split <- sample.split(model_data_full$SalaryLevel, SplitRatio = 0.8)
train_data <- subset(model_data_full, split == TRUE)
test_data <- subset(model_data_full, split == FALSE)
We are continuing our vanilla modeling process, where we
apply a multinomial logistic regression to the training data
(which was previously split using the vanilla 80/20 split). This code
block trains the model on the train_data
and displays the model summary, which includes coefficient estimates,
standard errors, residual deviance, and AIC.
library(nnet)
vanilla_model <- multinom(SalaryLevel ~ ., data = train_data, trace = FALSE)
summary(vanilla_model)
## Call:
## multinom(formula = SalaryLevel ~ ., data = train_data, trace = FALSE)
##
## Coefficients:
## (Intercept) AtBat Hits Runs RBI Walks Years
## Starter 3.1899758 -3.627942 1.0420699 0.6584629 1.406420 0.5519796 0.1594883
## Pro 2.4185526 -4.905749 2.4735754 0.4483905 1.408011 1.4665724 -1.5880947
## MVP 0.3240155 -3.223980 0.3038925 0.7428968 2.277004 1.7460073 -3.2610394
## CHits CHmRun DivisionWest PutOuts Errors NewLeague
## Starter 5.487762 -0.6177391 -0.8706095 -0.1544943 -0.4234180 0.6556809
## Pro 9.686622 -0.5235629 -0.9806671 0.1341253 -0.2609225 0.9321448
## MVP 14.725130 -0.5496685 -1.5136167 0.5480782 -0.4209294 0.2098092
##
## Std. Errors:
## (Intercept) AtBat Hits Runs RBI Walks Years
## Starter 1.721113 2.131758 1.768098 1.270791 0.9716359 0.8030795 1.511899
## Pro 1.867406 2.357700 1.958366 1.414882 1.1346763 0.8999374 1.646195
## MVP 2.138189 2.561209 2.309380 1.579893 1.2669626 0.9579582 1.854481
## CHits CHmRun DivisionWest PutOuts Errors NewLeague
## Starter 1.662908 0.8347009 0.8450726 0.4961481 0.5524199 0.9446413
## Pro 2.055991 1.0169626 0.9564956 0.5656962 0.6209299 1.0488414
## MVP 2.630358 1.2272514 1.0857020 0.6358958 0.6665705 1.1740851
##
## Residual Deviance: 224.9579
## AIC: 302.9579
Interpretation
The output shows how each predictor influences the odds of being classified in a higher salary category relative to the baseline (Rookie).
For example, players with more
CHits, RBI,
and Walks are associated with higher
probability of being in the MVP category, indicated by large positive
coefficients.
The AIC value (302.96) and Residual Deviance (224.96) give us a sense of model fit — these can later be compared to the LOO and stepwise models.
In this chunk, we’re testing the vanilla multinomial regression model
on the test data to evaluate its performance. The code uses the trained
vanilla_model to generate predictions for
the test set (vanilla_preds) and then
compares these predictions with the actual salary levels using a
confusion matrix (vanilla_conf_mat).
This helps us assess how well the model generalizes to unseen data, which is essential for understanding its real-world predictive ability.
library(caret)
vanilla_preds <- predict(vanilla_model, newdata = test_data)
vanilla_conf_mat <- confusionMatrix(vanilla_preds, test_data$SalaryLevel)
This code extracts and prints the accuracy and Cohen’s Kappa score
from the vanilla model’s confusion matrix. It rounds both metrics to
four decimal places for readability, then outputs them using the
cat() function. These two metrics are
crucial for evaluating the model’s overall classification performance
and agreement beyond chance.
acc_vanilla <- round(vanilla_conf_mat$overall["Accuracy"], 4)
kappa_vanilla <- round(vanilla_conf_mat$overall["Kappa"], 4)
cat("**Vanilla Accuracy:**", acc_vanilla, "\n")
## **Vanilla Accuracy:** 0.6458
cat("**Vanilla Cohen’s Kappa:**", kappa_vanilla, "\n")
## **Vanilla Cohen’s Kappa:** 0.5275
Interpretation
In this chunk, we’re visualizing the performance of our vanilla (simple train/test split) model using a confusion matrix.
library(knitr)
library(kableExtra)
vanilla_table <- as.data.frame(vanilla_conf_mat$table)
vanilla_table_wide <- tidyr::pivot_wider(vanilla_table, names_from = Reference, values_from = Freq)
kable(vanilla_table_wide, caption = "Vanilla Validation: Confusion Matrix") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Prediction | Rookie | Starter | Pro | MVP |
|---|---|---|---|---|
| Rookie | 10 | 3 | 0 | 0 |
| Starter | 1 | 7 | 4 | 0 |
| Pro | 1 | 2 | 6 | 3 |
| MVP | 0 | 0 | 3 | 8 |
Interpretation
The confusion matrix reveals how well the vanilla model performed on the test set. Most classes are somewhat accurately predicted, especially “Rookie” (10 correct out of 13), but we observe notable misclassifications particularly for “Starter” and “Pro”.
For example, “Pro” predictions were frequently confused with “MVP” and “Starter”, indicating the model struggles with finer distinctions among higher salary levels.
This shows that while the vanilla model captures general patterns, its predictive accuracy still leaves room for improvement.
In this part, we are creating a final visual comparison of the model
performance metrics—Accuracy and Cohen’s Kappa—across
three different models: Full LOO-CV, Reduced LOO-CV, and Vanilla
validation. We first build a data frame that stores the metric values
for each model. Then, we reshape it into long format using
pivot_longer() so it’s easier to plot with
ggplot2.
The bar chart is constructed to display both Accuracy and Kappa side by side for each model. This allows us to directly visualize how model performance varies depending on the validation strategy and model complexity.
compare_df <- data.frame(
Model = c("Full LOO-CV", "Reduced LOO-CV", "Vanilla"),
Accuracy = c(0.6597, 0.6681, 0.6458),
Kappa = c(0.5456, 0.5564, 0.5275)
)
library(ggplot2)
library(tidyr)
compare_long <- pivot_longer(compare_df, cols = c("Accuracy", "Kappa"))
ggplot(compare_long, aes(x = Model, y = value, fill = name)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Model Comparison: Accuracy and Kappa",
y = "Metric Value", x = "", fill = "Metric") +
theme_minimal()
Interpretation
The bar plot clearly shows that the Full model using LOO-CV performs the best, achieving the highest values for both Accuracy (~0.66) and Kappa (~0.55).
The Reduced model, while slightly more efficient with fewer predictors, performs just a bit worse, suggesting that trimming variables comes at a small cost to performance.
The Vanilla model, which uses a simple 80/20 train-test split, shows the lowest metrics, especially for Kappa, indicating less consistent predictions across classes.
Overall, this comparison supports that more rigorous validation (LOO-CV) and including informative predictors improves model reliability.
library(caret)
reduced_preds <- predict(stepwise_model, newdata = model_data_full)
conf_mat_reduced <- confusionMatrix(reduced_preds, model_data_full$SalaryLevel)
In this section, we combine the confusion matrices from all three models—Full, Reduced (stepwise), and Vanilla—into a single dataset for visual comparison. Each matrix is first transformed into a data frame, and a new column called Model is added to distinguish them.
library(ggplot2)
library(dplyr)
library(tidyr)
full_cm_df <- as.data.frame(conf_mat$table) %>% mutate(Model = "Full")
reduced_cm_df <- as.data.frame(conf_mat_reduced$table) %>% mutate(Model = "Reduced")
vanilla_cm_df <- as.data.frame(vanilla_conf_mat$table) %>% mutate(Model = "Vanilla")
combined_cm <- bind_rows(full_cm_df, reduced_cm_df, vanilla_cm_df)
ggplot(combined_cm, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "black", size = 3) +
facet_wrap(~Model, ncol = 3) +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Confusion Matrices: Full vs Reduced vs Vanilla", fill = "Freq") +
theme_minimal()
Interpretation
The visualization makes it clear that the Full model consistently performs better across all salary levels, showing higher and more concentrated diagonal values (correct predictions).
The Reduced model, while still effective, shows a slight drop in classification performance, especially for the MVP class.
The Vanilla model performs the worst, particularly struggling with predicting Pro and MVP levels, with more scattered off-diagonal values.
Overall, this comparison visually confirms that reducing predictors or using simple validation (like vanilla split) can lead to diminished predictive accuracy and class discrimination power.
We are choosing the Random Forest model because it offers a strong balance between predictive power and interpretability, especially in complex multiclass classification problems like ours. While logistic regression models (full, reduced, and vanilla) give us useful baseline insights and allow for variable significance testing, they rely on linear boundaries and can underperform when relationships between predictors and the target are non-linear or involve interactions. In contrast, Random Forest can capture these non-linearities and interactions naturally, without requiring extensive feature engineering.
Moreover, Random Forest is an ensemble method that builds multiple decision trees and averages their results, reducing variance and improving generalization—making it more robust to overfitting than single-tree models. It also provides useful variable importance metrics, helping us understand which features contribute most to predictions. Given our goal of accurately classifying players into salary levels based on a variety of performance metrics, Random Forest gives us a powerful, flexible, and relatively interpretable tool to boost classification performance beyond what standard regression models can offer.
In this section, we trained a Random Forest classifier using the full
training dataset (train_data) to predict
the categorical SalaryLevel variable.
The model was trained with 500 trees and 3 predictors tried at each split (a default setting for classification tasks).
library(randomForest)
rf_model <- randomForest(SalaryLevel ~ ., data = train_data, importance = TRUE)
print(rf_model)
##
## Call:
## randomForest(formula = SalaryLevel ~ ., data = train_data, importance = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 34.21%
## Confusion matrix:
## Rookie Starter Pro MVP class.error
## Rookie 41 4 3 0 0.1458333
## Starter 5 31 10 1 0.3404255
## Pro 0 15 26 11 0.5000000
## MVP 0 2 14 27 0.3720930
Interpretation
The OOB estimate of error rate is 34.21%, which means the model misclassifies around one-third of the observations on average when evaluated internally during training.
The confusion matrix shows that Rookie is the best predicted class with the lowest class error (around 14.6%), while Pro has the highest error rate at 50%, suggesting difficulty distinguishing Pro players.
The performance varies significantly across categories, which may indicate class imbalance or overlapping feature profiles. Despite its complexity, the Random Forest doesn’t drastically outperform simpler models here, but it still gives a different and potentially more generalizable perspective.
rf_preds <- predict(rf_model, newdata = test_data)
library(caret)
rf_conf_mat <- confusionMatrix(rf_preds, test_data$SalaryLevel)
This part extracts the accuracy and Cohen’s Kappa score from the confusion matrix of the trained Random Forest model and prints them in a readable format.
acc_rf <- round(rf_conf_mat$overall["Accuracy"], 4)
kappa_rf <- round(rf_conf_mat$overall["Kappa"], 4)
cat("**Random Forest Accuracy:**", acc_rf, "\n")
## **Random Forest Accuracy:** 0.6875
cat("**Random Forest Cohen’s Kappa:**", kappa_rf, "\n")
## **Random Forest Cohen’s Kappa:** 0.5831
Interpretation
library(knitr)
library(kableExtra)
rf_table_not_tuned <- as.data.frame(rf_conf_mat$table)
rf_table_wide <- tidyr::pivot_wider(rf_table_not_tuned, names_from = Reference, values_from = Freq)
kable(rf_table_wide, caption = "Random Forest: Confusion Matrix") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Prediction | Rookie | Starter | Pro | MVP |
|---|---|---|---|---|
| Rookie | 11 | 4 | 0 | 0 |
| Starter | 0 | 7 | 3 | 0 |
| Pro | 1 | 1 | 7 | 3 |
| MVP | 0 | 0 | 3 | 8 |
Interpretation
The confusion matrix for the Random Forest model shows solid overall performance, especially for the Rookie and MVP classes. Out of 15 Rookies, 11 were correctly classified, with only 4 misclassified as Starters.
The MVP class had 8 correct predictions out of 11, with 3 being misclassified as Pro, suggesting some overlap between high-performing players. The Starter class shows decent accuracy with 7 correct out of 10, though a few are misclassified as Pro.
The Pro class is the most challenging, with only 7 correct out of 12 and noticeable confusion with MVP. Overall, the model handles the extreme salary levels well but has some difficulty distinguishing between mid-range categories like Starter and Pro.
library(dplyr)
library(knitr)
library(kableExtra)
rf_by_class_not_tuned <- as.data.frame(rf_conf_mat$byClass)
rf_by_class_not_tuned$Class <- rownames(rf_conf_mat$byClass)
rownames(rf_by_class_not_tuned) <- NULL
rf_by_class_not_tuned$Class <- gsub("Class: ", "", rf_by_class_not_tuned$Class)
rf_by_class_not_tuned$Balanced.Accuracy <- round((rf_by_class_not_tuned$Sensitivity + rf_by_class_not_tuned$Specificity) / 2, 3)
rf_class_metrics_not_tuned <- rf_by_class_not_tuned %>%
dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
kable(rf_class_metrics_not_tuned, caption = "Random Forest: Class-wise Performance Metrics") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Class | Sensitivity | Specificity | Balanced.Accuracy |
|---|---|---|---|
| Rookie | 0.917 | 0.889 | 0.903 |
| Starter | 0.583 | 0.917 | 0.750 |
| Pro | 0.538 | 0.857 | 0.698 |
| MVP | 0.727 | 0.919 | 0.823 |
Interpretation
The class-wise performance metrics for the Random Forest model confirm what we observed in the confusion matrix. The model performs best on the Rookie class, with a high sensitivity (0.917) and balanced accuracy (0.903), meaning it correctly identifies most Rookies while rarely confusing them with other classes.
MVP also shows strong results, with a balanced accuracy of 0.823. However, the model struggles more with the Starter and especially the Pro classes, which have lower sensitivities (0.583 and 0.538, respectively), indicating that many true Starters and Pros are being misclassified.
These mid-range categories are likely harder to distinguish due to overlapping performance metrics, highlighting a limitation of the model in separating similar salary levels.
We are performing hyperparameter tuning on the Random Forest model to improve its predictive performance and generalization.
By using 10-fold cross-validation (cv)
with tuneLength = 10, we let the model try
different values of mtry (number of
variables considered at each split) and choose the one that results in
the best accuracy.
This helps reduce overfitting and ensures that the model performs
well not just on the training data but also on unseen data. The code
accomplishes this using the caret
package’s train() function, specifying the
tuning process through trainControl() and
enabling feature importance output.
library(caret)
set.seed(123)
rf_tuned <- train(SalaryLevel ~ .,
data = train_data,
method = "rf",
trControl = trainControl(method = "cv", number = 10),
tuneLength = 10,
importance = TRUE)
rf_tuned$bestTune
## mtry
## 1 2
Interpretation
The output shows that the best mtry
value found through cross-validation is 2, meaning the optimal number of
predictors to consider at each split in the Random Forest model is
2.
This value likely provided the best balance between bias and
variance during training, leading to the most accurate model performance
out of all the tested configurations. Choosing a smaller
mtry often results in more diverse trees,
which can help the ensemble generalize better.
plot(rf_tuned)
Interpretation
The plot shows the cross-validated accuracy of the Random Forest
model across different values of mtry,
which is the number of predictors randomly selected at each
split.
Accuracy peaks when mtry = 2,
confirming the earlier result that this value is optimal. As
mtry increases beyond 2, the accuracy
gradually decreases, indicating that using more predictors per split can
introduce redundancy and reduce model performance. This reinforces the
benefit of smaller mtry values in
promoting tree diversity and improving generalization.
In this code, we are evaluating the performance of the tuned Random
Forest model on the test data. First, predictions
(rf_tuned_preds) are generated using the
predict() function. Then, we assess how
well the model performed by comparing these predictions to the actual
salary levels in the test set using a confusion matrix.
rf_tuned_preds <- predict(rf_tuned, newdata = test_data)
conf_mat_rf_tuned <- confusionMatrix(rf_tuned_preds, test_data$SalaryLevel, mode = "everything")
In this code, we are comparing the class-wise performance metrics of the untuned and tuned Random Forest models.
We extract sensitivity, specificity, and balanced accuracy for each salary class from the respective confusion matrices.
These values are cleaned and labeled to reflect the model they come from. The shared columns between both models are then selected and the results are combined into a single dataframe.
Finally, the metrics are neatly displayed using a formatted table to assess how tuning the Random Forest impacted performance at the class level.
rf_class_metrics <- as.data.frame(rf_conf_mat$byClass)
rf_class_metrics$Class <- gsub("Class: ", "", rownames(rf_class_metrics))
rf_class_metrics$Model <- "RF Untuned"
rf_tuned_class_metrics <- as.data.frame(conf_mat_rf_tuned$byClass)
rf_tuned_class_metrics$Class <- gsub("Class: ", "", rownames(rf_tuned_class_metrics))
rf_tuned_class_metrics$Model <- "RF Tuned"
colnames(rf_class_metrics)[colnames(rf_class_metrics) == "Balanced Accuracy"] <- "Balanced Accuracy"
colnames(rf_tuned_class_metrics)[colnames(rf_tuned_class_metrics) == "Balanced Accuracy"] <- "Balanced Accuracy"
cols_to_use <- intersect(
c("Model", "Class", "Sensitivity", "Specificity", "Balanced Accuracy"),
intersect(colnames(rf_class_metrics), colnames(rf_tuned_class_metrics))
)
rf_class_metrics <- rf_class_metrics[, cols_to_use]
rf_tuned_class_metrics <- rf_tuned_class_metrics[, cols_to_use]
rf_comparison <- rbind(rf_class_metrics, rf_tuned_class_metrics)
kable(rf_comparison, caption = "Class-wise Performance: RF Untuned vs RF Tuned") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Model | Class | Sensitivity | Specificity | Balanced Accuracy | |
|---|---|---|---|---|---|
| Class: Rookie | RF Untuned | Rookie | 0.9166667 | 0.8888889 | 0.9027778 |
| Class: Starter | RF Untuned | Starter | 0.5833333 | 0.9166667 | 0.7500000 |
| Class: Pro | RF Untuned | Pro | 0.5384615 | 0.8571429 | 0.6978022 |
| Class: MVP | RF Untuned | MVP | 0.7272727 | 0.9189189 | 0.8230958 |
| Class: Rookie1 | RF Tuned | Rookie | 0.9166667 | 0.8888889 | 0.9027778 |
| Class: Starter1 | RF Tuned | Starter | 0.5833333 | 0.9444444 | 0.7638889 |
| Class: Pro1 | RF Tuned | Pro | 0.6923077 | 0.8857143 | 0.7890110 |
| Class: MVP1 | RF Tuned | MVP | 0.8181818 | 0.9459459 | 0.8820639 |
Interpretation
From the output, we observe that tuning the Random Forest model generally improved class-wise performance, especially for the “Pro” and “MVP” categories.
For instance, the sensitivity of the “Pro” class increased from 0.538 to 0.692, and “MVP” rose from 0.727 to 0.818. Balanced accuracy for all classes also saw improvements, with the “MVP” class reaching an impressive 0.88.
These improvements suggest that tuning enhanced the model’s ability to correctly identify less frequent or more complex classes, boosting its overall robustness.
In this code chunk, we are displaying the confusion matrix for the
tuned Random Forest model using kable.
This format allows us to easily see where the model made correct or
incorrect predictions. The styled output table is then rendered using
the kableExtra package for better visual
presentation.
library(knitr)
library(kableExtra)
rf_table <- as.data.frame(conf_mat_rf_tuned$table)
rf_table_wide <- tidyr::pivot_wider(rf_table, names_from = Reference, values_from = Freq)
kable(rf_table_wide, caption = "Random Forest Tuned: Confusion Matrix") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Prediction | Rookie | Starter | Pro | MVP |
|---|---|---|---|---|
| Rookie | 11 | 4 | 0 | 0 |
| Starter | 0 | 7 | 2 | 0 |
| Pro | 1 | 1 | 9 | 2 |
| MVP | 0 | 0 | 2 | 9 |
Interpretation
The resulting confusion matrix for the tuned Random Forest model shows that most predictions fall along the diagonal, indicating correct classifications.
For instance, all 9 MVP players and 11 out of 15 Rookies were correctly predicted. There are a few misclassifications, like 4 Rookies predicted as Starters and some Pro players being mistaken for MVPs or Starters.
However, overall, the tuned model seems to perform fairly well across classes, especially in correctly identifying MVPs and Rookies.
In this section, we aim to extract and summarize the class-wise performance metrics from the tuned Random Forest model.
Specifically, we gather values for Sensitivity, Specificity, and Balanced Accuracy for each salary class, round them for clarity, and organize the results into a readable table format.
library(dplyr)
library(knitr)
library(kableExtra)
rf_by_class <- as.data.frame(conf_mat_rf_tuned$byClass)
rf_by_class$Class <- rownames(conf_mat_rf_tuned$byClass)
rownames(rf_by_class) <- NULL
rf_by_class$Class <- gsub("Class: ", "", rf_by_class$Class)
rf_by_class$Balanced.Accuracy <- round((rf_by_class$Sensitivity + rf_by_class$Specificity) / 2, 3)
rf_class_metrics <- rf_by_class %>%
dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
kable(rf_class_metrics, caption = "Random Forest Tuned: Class-wise Performance Metrics") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Class | Sensitivity | Specificity | Balanced.Accuracy |
|---|---|---|---|
| Rookie | 0.917 | 0.889 | 0.903 |
| Starter | 0.583 | 0.944 | 0.764 |
| Pro | 0.692 | 0.886 | 0.789 |
| MVP | 0.818 | 0.946 | 0.882 |
Interpretation
From the results, we observe that the tuned Random Forest model performs exceptionally well on the “Rookie” and “MVP” classes, with balanced accuracy values of 0.903 and 0.882, respectively.
The “Starter” class has the lowest balanced accuracy at 0.764, indicating more difficulty in correctly identifying instances in that category.
Overall, this detailed class-wise evaluation confirms that tuning has improved the model’s performance, especially for the more complex classes like “Pro” and “MVP”.
rf_cm_df <- as.data.frame(rf_conf_mat$table) %>%
mutate(Model = "Random Forest")
combined_cm <- bind_rows(full_cm_df, reduced_cm_df, vanilla_cm_df, rf_cm_df)
library(ggplot2)
ggplot(combined_cm, aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), color = "black") +
facet_wrap(~ Model) +
scale_fill_gradient(low = "white", high = "steelblue") +
labs(title = "Confusion Matrices: All Models Compared",
fill = "Frequency") +
theme_minimal()
Interpretation
The comparison of confusion matrices across all models reveals key insights into their relative performance.
The Full and Reduced models demonstrate strong predictive power, especially for the “Pro” and “MVP” classes, though they tend to over-predict MVPs, likely due to feature correlations.
The Vanilla model struggles significantly, with many misclassifications and low correct counts across all classes—especially “Pro” and “MVP”, where predictions are highly scattered.
In contrast, the Random Forest model shows a more balanced distribution, particularly for the “Rookie” and “Starter” classes, and captures the MVPs better than Vanilla. Notably, Random Forest reduces confusion between “Pro” and “Starter”, a common issue in other models.
Overall, the confusion matrix comparison suggests that Random Forest, especially when tuned, enhances class-level precision, while Vanilla underperforms due to its simpler structure.
We chose to include K-Nearest Neighbors (KNN) in our model comparison because it’s a simple, intuitive, and non-parametric method that classifies observations based on the majority class among the closest training examples.
Unlike models that assume linear relationships (like logistic regression), KNN captures local patterns in the data, which can be especially useful in a multi-class setting like ours. It also helps us evaluate how well a distance-based approach performs compared to more complex ensemble models like Random Forest, offering valuable contrast in terms of both methodology and outcomes.
This code prepares for KNN modeling using the
caret package. We set a random seed for
reproducibility and configure 10-fold cross-validation to evaluate model
performance. The settings are stored in
knn_ctrl for later use.
library(caret)
set.seed(123)
knn_ctrl <- trainControl(method = "cv", number = 10)
Here we train our KNN model to predict salary levels
(SalaryLevel) using all the other
variables in our training data. We’re using the settings we made earlier
for 10-fold cross-validation (knn_ctrl).
The tuneLength = 10 part means the model
will automatically try 10 different values for k (the number of
neighbors) to find which works best. The trained model gets saved as
knn_model so we can check how well it
performs later.
knn_model <- train(SalaryLevel ~ .,
data = train_data,
method = "knn",
trControl = knn_ctrl,
tuneLength = 10)
knn_model$bestTune
## k
## 10 23
Interpretation
The output shows that our KNN model chose k=23 neighbors as the best option after testing different values. This means when making predictions, the model looks at the 23 most similar data points in our training set to decide what salary level to predict.
A k-value of 23 suggests our data might be pretty complex - it needs to consider quite a few neighbors to make good predictions rather than just looking at the very closest ones.
plot(knn_model)
Interpretation
The graph shows how the model’s accuracy changes as we use different numbers of neighbors (k). When k=5, accuracy is around 62%, which is the highest. As we increase k, accuracy drops—down to about 57% at k=20.
What this means:
Best k-value: The sweet spot seems to be k=23, since accuracy is highest there.
Performance: Even the best accuracy (62%) isn’t amazing—it’s slightly better than random guessing (which would be ~50% for two classes).
Here we are making predictions on the salary level using KNN on the test data.
knn_preds <- predict(knn_model, newdata = test_data)
Here we are creating a confusion matrix with all the metrics of KNN’s performance.
conf_mat_knn <- confusionMatrix(knn_preds, test_data$SalaryLevel, mode = "everything")
Here we’re summarizing how the KNN model performed for each salary class by calculating sensitivity, specificity, and balanced accuracy. This helps us see how well the model is able to correctly identify each class, not just its overall accuracy.
library(dplyr)
library(knitr)
library(kableExtra)
knn_by_class <- as.data.frame(conf_mat_knn$byClass)
knn_by_class$Class <- rownames(conf_mat_knn$byClass)
rownames(knn_by_class) <- NULL
knn_by_class$Class <- gsub("Class: ", "", knn_by_class$Class)
knn_by_class$Balanced.Accuracy <- round((knn_by_class$Sensitivity +
knn_by_class$Specificity) / 2, 3)
knn_class_metrics <- knn_by_class %>%
dplyr::select(Class, Sensitivity, Specificity, Balanced.Accuracy) %>%
mutate(across(where(is.numeric), ~ round(.x, 3)))
kable(knn_class_metrics, caption = "KNN: Class-wise Performance Metrics") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed"))
| Class | Sensitivity | Specificity | Balanced.Accuracy |
|---|---|---|---|
| Rookie | 0.833 | 0.917 | 0.875 |
| Starter | 0.333 | 0.889 | 0.611 |
| Pro | 0.538 | 0.657 | 0.598 |
| MVP | 0.455 | 0.919 | 0.687 |
Interpretation
The class-wise performance results for the KNN model show that it performs best at identifying “Rookie” players, with a high balanced accuracy of 0.861.
However, it struggles significantly with the “Starter” and “Pro” categories, both of which have balanced accuracies below 0.57, indicating poor classification reliability for those groups.
The “MVP” class performs slightly better, but still falls behind the performance seen in Random Forest or logistic models.
Overall, KNN appears to be less effective for this dataset, especially for distinguishing between middle-range salary levels where players may have more overlapping characteristics.
Here we are displaying the confusion matrix for KNN in a clean format so we can easily see how many observations were correctly or incorrectly predicted for each salary level.
library(knitr)
library(kableExtra)
library(tidyr)
knn_table <- as.data.frame(conf_mat_knn$table)
knn_table_wide <- pivot_wider(knn_table,
names_from = Reference,
values_from = Freq)
kable(knn_table_wide, caption = "KNN: Confusion Matrix") %>%
kable_styling(full_width = FALSE,
bootstrap_options = c("striped", "hover", "condensed"))
| Prediction | Rookie | Starter | Pro | MVP |
|---|---|---|---|---|
| Rookie | 10 | 3 | 0 | 0 |
| Starter | 0 | 4 | 4 | 0 |
| Pro | 2 | 4 | 7 | 6 |
| MVP | 0 | 1 | 2 | 5 |
Interpretation
The confusion matrix shows that KNN performs quite well when predicting “Rookie” players, correctly classifying 10 out of 14.
However, the model struggles with the other classes. “Starter” is often confused with “Pro”, and “Pro” predictions are widely spread across all other categories, with only 6 correct out of 18. The “MVP” class also sees considerable confusion, especially with “Pro”, where 3 MVPs were misclassified.
In general, the KNN model lacks consistency in distinguishing between mid to high salary levels, suggesting it may not be the best fit for the structure and complexity of this dataset.
In this code chunk, we are building a combined table to compare the class-wise performance of three models: the tuned Random Forest, untuned Random Forest, and k-NN. For each model, we collect the key performance metrics — Sensitivity, Specificity, and Balanced Accuracy — and label them with the corresponding model name. These tables are then merged into one, which is printed using a styled table to visualize and directly compare how well each model performs across different salary classes.
library(dplyr)
library(kableExtra)
rf_class_metrics$Model <- "RF Tuned"
rf_class_metrics_not_tuned$Model <- "RF Untuned"
knn_class_metrics$Model <- "k-NN"
colnames(rf_class_metrics)[colnames(rf_class_metrics) == "Balanced.Accuracy"] <- "Balanced Accuracy"
colnames(rf_class_metrics_not_tuned)[colnames(rf_class_metrics_not_tuned) == "Balanced.Accuracy"] <- "Balanced Accuracy"
colnames(knn_class_metrics)[colnames(knn_class_metrics) == "Balanced.Accuracy"] <- "Balanced Accuracy"
cols_to_use <- c("Model", "Class", "Sensitivity", "Specificity", "Balanced Accuracy")
model_comparison <- bind_rows(
rf_class_metrics[, cols_to_use],
rf_class_metrics_not_tuned[, cols_to_use],
knn_class_metrics[, cols_to_use]
)
kable(model_comparison, caption = "Class-wise Performance Comparison: RF Tuned vs RF Untuned vs k-NN") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Model | Class | Sensitivity | Specificity | Balanced Accuracy |
|---|---|---|---|---|
| RF Tuned | Rookie | 0.917 | 0.889 | 0.903 |
| RF Tuned | Starter | 0.583 | 0.944 | 0.764 |
| RF Tuned | Pro | 0.692 | 0.886 | 0.789 |
| RF Tuned | MVP | 0.818 | 0.946 | 0.882 |
| RF Untuned | Rookie | 0.917 | 0.889 | 0.903 |
| RF Untuned | Starter | 0.583 | 0.917 | 0.750 |
| RF Untuned | Pro | 0.538 | 0.857 | 0.698 |
| RF Untuned | MVP | 0.727 | 0.919 | 0.823 |
| k-NN | Rookie | 0.833 | 0.917 | 0.875 |
| k-NN | Starter | 0.333 | 0.889 | 0.611 |
| k-NN | Pro | 0.538 | 0.657 | 0.598 |
| k-NN | MVP | 0.455 | 0.919 | 0.687 |
Interpretation
In this section, we are collecting and comparing the overall performance metrics—Accuracy and Cohen’s Kappa—for three different models: Random Forest Tuned, Random Forest Untuned, and k-Nearest Neighbors (k-NN). The values are extracted from each model’s confusion matrix and organized into a single summary table for easy comparison.
overall_metrics <- data.frame(
Model = c("RF Tuned", "RF Untuned", "k-NN"),
Accuracy = c(
round(conf_mat_rf_tuned$overall["Accuracy"], 4),
round(rf_conf_mat$overall["Accuracy"], 4),
round(conf_mat_knn$overall["Accuracy"], 4)
),
Kappa = c(
round(conf_mat_rf_tuned$overall["Kappa"], 4),
round(rf_conf_mat$overall["Kappa"], 4),
round(conf_mat_knn$overall["Kappa"], 4)
)
)
kable(overall_metrics, caption = "Overall Accuracy & Kappa: RF Tuned vs RF Untuned vs k-NN") %>%
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover", "condensed"))
| Model | Accuracy | Kappa |
|---|---|---|
| RF Tuned | 0.7500 | 0.6663 |
| RF Untuned | 0.6875 | 0.5831 |
| k-NN | 0.5417 | 0.3850 |
Interpretation
In this project, we set out to predict player salary levels using the Hitters dataset by applying and comparing multiple classification models. After cleaning the data and preparing the variables, we explored multinomial logistic regression, Random Forest (both tuned and untuned), and k-nearest neighbors (k-NN). The focus wasn’t just on building models but also on understanding their performance using meaningful evaluation metrics like accuracy, Cohen’s kappa, and class-wise sensitivity, specificity, and balanced accuracy.
From our results, the tuned Random Forest consistently performed the best. It achieved the highest overall accuracy (75%) and Cohen’s kappa (0.6663), suggesting strong agreement between the predicted and actual salary classes. It also delivered strong class-wise performance, especially for higher-level salary classes like Pro and MVP. The untuned Random Forest still performed fairly well but was slightly behind the tuned version in all metrics. On the other hand, k-NN showed the weakest performance, with an accuracy of 50% and a much lower kappa score (0.3850), particularly struggling with distinguishing higher salary classes.
Overall, this comparison highlights the value of tuning machine learning models and choosing algorithms that can handle class imbalance effectively. While simpler models like k-NN are easy to implement, they may not always perform well in more complex classification tasks. The Random Forest, especially when tuned, showed the best balance between accuracy and generalization, making it the most reliable model for this dataset.
Our group had some assistance from OpenAI for debugging and occasional help with understanding outputs or cleaning up explanations.